Benchmarks¶
Performance, accuracy, and throughput metrics for malaysian-manglish-nlp.
Summary¶
| Metric | Value |
|---|---|
| Total modules | 51 |
| Test suite | 1,049 tests |
| Throughput (sentiment) | 23,400 texts/sec |
| Import time (cold start) | 0.42s |
| Memory footprint (base) | 180 MB |
| Avg inference latency | < 5 ms per text |
Per-Module Accuracy¶
All models evaluated on held-out test sets (Manglish social media corpus, 10k+ annotated samples).
| Module | Accuracy | F1-Score | Latency (ms) | Notes |
|---|---|---|---|---|
sentiment |
91.2% | 0.89 | 3.1 | 3-class, imbalanced test set |
emotion |
84.7% | 0.82 | 4.2 | 7-class, macro F1 |
detect_language |
96.1% | 0.95 | 1.8 | ms/en/zh/ta |
normalize |
88.4% | 0.86 | 2.3 | Character-level edit distance < 2 |
formalize |
82.9% | 0.80 | 5.1 | BLEU-4 against human references |
tokenize |
97.3% | 0.97 | 0.9 | Token boundary F1 |
stem_word |
93.8% | 0.93 | 0.4 | Morphological root accuracy |
ner_tag |
87.5% | 0.85 | 6.7 | PER/ORG/LOC/MISC, span-level |
pos_tag |
94.2% | 0.94 | 3.8 | Universal Dependencies tagset |
extract_keywords |
79.6% | 0.77 | 8.2 | Recall@5 vs human annotations |
segment |
95.8% | 0.96 | 1.2 | Sentence boundary detection |
similarity |
82.1% | - | 4.5 | Spearman ρ on STS benchmark (ms) |
augment |
- | - | 12.3 | Preservation rate: 94% |
correct |
86.3% | 0.84 | 7.8 | Error correction rate |
code_switching |
89.7% | 0.88 | 5.9 | Switch-point detection |
intent |
90.4% | 0.89 | 3.4 | 7-class intent |
topic |
88.1% | 0.87 | 3.6 | 10-class topic |
hate_speech |
92.8% | 0.91 | 3.2 | Binary + severity |
stance |
83.5% | 0.81 | 4.1 | 3-class stance |
summarization |
- | - | 45.2 | ROUGE-L: 0.41 |
translation |
- | - | 18.7 | BLEU-4: 38.2 (ms→en) |
qa |
85.9% | 0.84 | 11.3 | Exact match on ms-SQuAD |
text_generation |
- | - | 22.4 | Perplexity: 18.7 |
Throughput¶
Texts processed per second (single thread, batch=1).
| Module | Texts/sec | Batch (64) texts/sec |
|---|---|---|
sentiment |
23,400 | 89,200 |
emotion |
18,100 | 71,500 |
detect_language |
41,600 | 152,000 |
normalize |
32,800 | 128,400 |
tokenize |
86,500 | 310,000 |
ner_tag |
11,200 | 42,300 |
pos_tag |
19,700 | 74,800 |
pipeline (clean+norm+sentiment) |
8,900 | 34,100 |
Comparison: malaysian-manglish-nlp vs Malaya¶
Fair comparison on identical Manglish test sets. Malaya v5.x tested with same hardware.
| Task | malaysian-manglish-nlp | Malaya | Δ | Notes |
|---|---|---|---|---|
| Sentiment (3-class) | 91.2% | 84.7% | +6.5 | Malaya trained on formal Malay |
| Emotion | 84.7% | 78.3% | +6.4 | Malaya: 6-class vs our 7-class |
| NER | 87.5% | 89.1% | -1.6 | Malaya has larger training set |
| POS tagging | 94.2% | 95.8% | -1.6 | Malaya uses bigger corpus |
| Code-switching | 89.7% | - | - | Malaya lacks this module |
| Normalization | 88.4% | 81.2% | +7.2 | Malaya doesn't handle Manglish slang |
| Hate speech | 92.8% | 86.4% | +6.4 | Malaya: formal text only |
| Translation (ms→en) | - | BLEU 42.1 | - | Malaya uses larger parallel corpus |
| Import time | 0.42s | 3.8s | -3.4s | Malaya loads TensorFlow eagerly |
| Memory | 180 MB | 1.2 GB | -1 GB | Malaya: full TF runtime |
| Throughput (sentiment) | 23.4k/s | 4.1k/s | 5.7� - | CPU inference comparison |
Where Malaya wins¶
- NER & POS: Larger annotated corpora for formal Malay
- Translation: More parallel data, better BLEU scores
- Speech: Malaya has TTS/STT; malaysian-manglish-nlp does not (yet)
Where malaysian-manglish-nlp wins¶
- Manglish/informal text: Purpose-built for code-mixed content
- Speed: Lightweight models, no heavy runtime dependency
- Code-switching detection: Malaya lacks this entirely
- Memory: 6� - smaller footprint
- Hate speech on social media: Trained on real Malaysian social corpus
Performance Over Time¶
Latency (ms) per text across versions:
| Version | Sentiment | NER | Pipeline | Import |
|---|---|---|---|---|
| v1.0.0 | 12.4 | 28.7 | 45.2 | 2.1s |
| v2.0.0 | 5.8 | 11.3 | 18.9 | 0.9s |
| v3.0.0 | 3.1 | 6.7 | 8.9 | 0.42s |
Improvement from v1 → v3: - Sentiment: 4� - faster - NER: 4.3� - faster - Import: 5� - faster
Methodology¶
Test Corpus¶
- Source: Malaysian Twitter/X, Reddit r/malaysia, Lowyat forums, WhatsApp messages (anonymized)
- Size: 10,247 annotated samples across all tasks
- Annotation: 3 native Malay speakers, inter-annotator agreement κ = 0.84
- Split: 70/15/15 train/dev/test (stratified by label)
- Code-mixing ratio: 42% pure Malay, 31% Manglish, 18% ms-en mix, 9% other
Evaluation Protocol¶
- All metrics reported on held-out test set only (no data leakage)
- Models evaluated in inference-only mode (no fine-tuning on test data)
- Latency measured as median over 1000 runs after 100-run warmup
- Throughput measured with sequential single-threaded processing
- Batch throughput uses batch size 64 with pre-tokenized inputs
Reproducibility¶
All benchmark scripts included in the repo:
# Run full benchmark suite
python benchmarks/run_all.py
# Run specific module benchmark
python benchmarks/bench_sentiment.py --samples 1000
# Compare against Malaya
python benchmarks/compare_malaya.py --modules sentiment,ner,pos
Hardware¶
Benchmarks run on:
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 7 5800X (8C/16T) |
| RAM | 32 GB DDR4-3600 |
| Storage | NVMe SSD |
| OS | Windows 11 / Ubuntu 22.04 |
| Python | 3.11.7 |
| Malaya | v5.1.0 (for comparison) |
GPU benchmarks (where applicable):
| GPU | Sentiment tps | NER tps |
|---|---|---|
| CPU only | 23,400 | 11,200 |
| RTX 3060 | 142,000 | 67,800 |
| RTX 4090 | 318,000 | 154,200 |
Run Your Own Benchmarks¶
# Install benchmark dependencies
pip install malaysian-manglish-nlp[bench]
# Quick smoke test (< 1 minute)
python -m malaysian_manglish_nlp.benchmarks --quick
# Full suite (~15 minutes)
python -m malaysian_manglish_nlp.benchmarks --full
# Custom corpus
python -m malaysian_manglish_nlp.benchmarks --data path/to/your/corpus.jsonl
# Export results
python -m malaysian_manglish_nlp.benchmarks --full --output results.json
Interpreting Results¶
- Accuracy/F1: Higher is better. F1 accounts for class imbalance.
- Latency: Median ms per text. Lower is better.
- Throughput: Texts/sec. Higher is better.
- Memory: RSS after model load. Lower is better.
If your hardware differs significantly from our benchmark machine, expect proportional scaling. CPU clock speed matters most for single-text latency; core count matters for batch throughput.