Skip to content

Pretrained Models

malaysian-manglish-nlp ships with pretrained models trained on real Malaysian social media data.


Word Embeddings

Model Method Dimensions Vocab Training Data
manglish-word2vec Word2Vec CBOW 100 518 50k+ tweets
manglish-fasttext FastText skip-gram 100 518 50k+ tweets

Loading word embeddings

from malaysian_manglish_nlp import word_embeddings

# Load Word2Vec
w2v = word_embeddings.load_word2vec()
w2v.most_similar("makan")
# [('nasi', 0.82), ('roti', 0.79), ('minum', 0.74), ...]

# Load FastText
ft = word_embeddings.load_fasttext()
ft.most_similar("best")
# [('gempak', 0.78), ('power', 0.76), ('padu', 0.73), ...]

Fine-tuned Multi-Task Model (v3.3.0)

An XLM-Roberta model fine-tuned on 28,263 labeled Manglish examples for multi-task classification.

Task Accuracy Classes
Sentiment 98.0% positive, negative, neutral
Emotion 96.5% happy, sad, angry, fear, surprise, disgust, love, neutral
Intent 99.3% question, statement, request, complaint, greeting, opinion, command, offer
Average 97.9%

Usage

from malaysian_manglish_nlp.transformers.manglish_model import load_model, predict

# Load model (auto-downloads from HuggingFace on first use)
model = load_model()

# Predict
result = predict("gila best servis ni")
# {'sentiment': {'label': 'positive', 'confidence': 0.96},
#  'emotion':    {'label': 'happy',    'confidence': 0.85},
#  'intent':     {'label': 'opinion',  'confidence': 1.00}}

# Batch prediction
results = predict_batch(["best gila", "teruk la", "ok je"])

Model details

  • Base: xlm-roberta-base
  • Architecture: Shared encoder + 3 task-specific heads (256 hidden units each)
  • Fine-tuned on: 22,610 train / 5,653 validation (from 28,263 total)
  • Training: 5 epochs, lr=2e-5 (encoder) / 2e-4 (heads), batch_size=16, gradient accumulation (effective batch 32)
  • Optimizer: AdamW with cosine annealing and warm restarts
  • Regularization: Focal loss for class imbalance, uncertainty-weighted multi-task loss, early stopping
  • Hardware: NVIDIA RTX 2070 8GB VRAM, mixed precision (FP16)
  • Model size: ~1.1GB (PyTorch state dict)

Training history

Epoch Train Loss Val Loss Sentiment Emotion Intent Avg Acc
1 1.258 0.928 63.0% 50.6% 75.6% 63.1%
2 0.743 0.599 78.9% 71.1% 87.9% 79.3%
3 0.462 0.441 86.1% 78.2% 92.1% 85.5%
4 0.316 0.390 87.4% 82.8% 94.4% 88.2%
5 0.243 0.375 88.5% 83.6% 94.5% 88.9%

v3.2.0 (XLM-Roberta, 14,384 examples): Sentiment 95.0%, Emotion 90.3%, Intent 97.5%, Avg 94.3%

v3.3.0 (XLM-Roberta, 28,263 examples): Sentiment 98.0%, Emotion 96.5%, Intent 99.3%, Avg 97.9%

v3.3.0 retraining notes: Filtered 4,801 partial-label samples that caused multi-task training KeyError. Contrast-marker-aware window scoring added for aspect sentiment. Pydantic v2 ConfigDict migration applied.

Download from HuggingFace

# Auto-download (built into load_model())
from malaysian_manglish_nlp.transformers.manglish_model import load_model
model = load_model()

# Or manual download
from huggingface_hub import hf_hub_download
hf_hub_download("vexccz/manglish-nlp-sentiment", "model.pt")
hf_hub_download("vexccz/manglish-nlp-sentiment", "config.json")
hf_hub_download("vexccz/manglish-nlp-sentiment", "tokenizer.json")
hf_hub_download("vexccz/manglish-nlp-sentiment", "tokenizer_config.json")

Rule-based fallback

If the fine-tuned model is not available (no [transformers] extra), use the built-in rule-based modules:

from malaysian_manglish_nlp import sentiment, detect_emotion, classify_intent

# Rule-based sentiment (no model needed)
result = sentiment("Best lah movie ni, memang power!")
# {'sentiment': 'positive', 'score': 0.94}

# Rule-based emotion
emotion = detect_emotion("sedih doh tak dapat tiket")
# {'emotion': 'sad', 'confidence': 0.82}

Comparison with previous model

v3.0.0 (561 examples) v3.1.0 (7,884 examples) v3.2.0 (14,384 examples) v3.3.0 (28,263 examples)
Sentiment 69% 88.5% 95.0% 98.0%
Emotion 63% 83.6% 90.3% 96.5%
Intent 69% 94.5% 97.5% 99.3%
Average 67% 88.9% 94.3% 97.9%
Base model DistilBERT DistilBERT XLM-Roberta XLM-Roberta
Tasks Single (sentiment) Multi-task (3 tasks) Multi-task (3 tasks) Multi-task (3 tasks)

Comparison with other models

Model Accuracy Notes
manglish-finetuned v3.3.0 97.9% XLM-Roberta multi-task, best for Manglish
manglish-finetuned v3.2.0 94.3% XLM-Roberta multi-task
manglish-finetuned v3.1.0 88.9% DistilBERT multi-task
Mesolitica NanoT5 (tiny) 86.1% Malay-only base
huseinzol05 sentiment 84.7% Broader Malay coverage
DistilBERT multilingual (zero-shot) 62.3% No fine-tuning

Model Storage

Models are stored locally or downloaded from HuggingFace:

~/.agents/skills/manglish-nlp/malaysian_manglish_nlp/resources/
├── manglish_finetuned/
│   ├── model.pt          # 1.1GB
│   ├── config.json
│   ├── tokenizer.json
│   └── tokenizer_config.json
└── word_embeddings/
    ├── word2vec.model
    └── fasttext.model

Citation

@software{malaysian_manglish_nlp_2026,
  title  = {malaysian-manglish-nlp: A Comprehensive NLP Toolkit for Malaysian Manglish},
  author = {Yusof, Zafran},
  year   = {2026},
  version = {3.3.0},
  url    = {https://github.com/ZafranYusof/malaysia-manglish-nlp}
}