Datasets¶

malaysian-manglish-nlp was trained on curated Malaysian text datasets.

All datasets are available on HuggingFace: vexccz/manglish-nlp-dataset.

Multi-Task Dataset (v3.2.0)¶

14,384 labeled examples with sentiment, emotion, and intent annotations.

File	Samples	Description
`manglish_14384.jsonl`	14,384	Full dataset (original + augmented)
`manglish_7884.jsonl`	7,884	Previous version (v3.1.0)
`manglish_labeled.jsonl`	561	Original v1 manually labeled
`manglish_labeled_v2.jsonl`	578	v2 rebalanced
`manglish_full.jsonl`	2,400+	Combined all versions

Labels¶

Sentiment (3 classes): positive, negative, neutral

Emotion (8 classes): happy, sad, angry, fear, surprise, disgust, love, neutral

Intent (8 classes): question, statement, request, complaint, greeting, opinion, command, offer

Additional fields¶

Field	Description
`language`	Detected language (manglish, malay, english)
`dialect`	Malay dialect (standard, kelantan, terengganu, n9, kedah, sarawak, sabah)
`topic`	Topic category
`source_type`	Data source (social_media, forum, etc.)
`is_code_switch`	Whether text contains code-switching

Example entry¶

{
  "text": "gila best makanan dia, confirm repeat lagi",
  "sentiment": "positive",
  "emotion": "happy",
  "intent": "opinion",
  "language": "manglish",
  "dialect": "standard",
  "is_code_switch": false
}

Data sources¶

Twitter/X - Malaysian users posting in Manglish
Lowyat forum posts
Malaysian news portal comments
Reddit r/malaysia
Augmented data (synonym replacement, shortform variation, back-translation)

Usage with HuggingFace Datasets¶

from datasets import load_dataset

ds = load_dataset("vexccz/manglish-nlp-dataset", data_files="manglish_14384.jsonl")
print(ds["train"][0])

Sentiment Dataset (v1.0)¶

Original 1,139 labeled examples for single-task sentiment classification.

Split	Samples	Positive	Negative	Neutral
Train	912	380	350	182
Test	227	95	88	44
Total	1,139	475	438	226

Normalisation Dictionary¶

638+ slang-to-standard mappings for Manglish text normalisation.

from malaysian_manglish_nlp import normalize

normalize("nk tnya brp sem utk grad")
# "nak tanya berapa semester untuk graduasi"

NER Dataset¶

2,250+ annotated sentences with 11 entity types:

Entity Type	Count	Example
PERSON	450+	Ali, Dr Siti, PM Anwar
ORGANIZATION	320+	UMP, Petronas, MARA
LOCATION	280+	Kuala Lumpur, Penang, Kuantan
PRODUCT	150+	iPhone, Proton X70
EVENT	120+	Hari Raya, Merdeka
DATE	200+	semalam, Isnin, 2026
TIME	80+	pagi, pukul 3
MONEY	60+	RM50, seratus ringgit
PHONE	40+	012-3456789
EMAIL	30+	ali@email.com
PERCENT	20+	50%, lima puluh peratus

Translation Pairs¶

1,000+ BM-EN word and phrase pairs for rule-based translation.

from malaysian_manglish_nlp import translate, to_english, to_malay

translate("Apa khabar?", source="ms", target="en")
# "How are you?"

to_english("Saya nak pergi kedai")
# "I want to go to the shop"

Training Your Own Models¶

# Using the finetune module
from malaysian_manglish_nlp.transformers.finetune import train

results = train(
    data_path="datasets/manglish_14384.jsonl",
    output_dir="my_model/",
    epochs=5,
    batch_size=16,
)
print(f"Best accuracy: {results['best_val_accuracy']:.4f}")

See Fine-tuned Models for full training details.

License¶

All datasets are released under the MIT License.