Benchmarks¶

Performance, accuracy, and throughput metrics for malaysian-manglish-nlp.

Summary¶

Metric	Value
Total modules	51
Test suite	1,049 tests
Throughput (sentiment)	23,400 texts/sec
Import time (cold start)	0.42s
Memory footprint (base)	180 MB
Avg inference latency	< 5 ms per text

Per-Module Accuracy¶

All models evaluated on held-out test sets (Manglish social media corpus, 10k+ annotated samples).

Module	Accuracy	F1-Score	Latency (ms)	Notes
`sentiment`	91.2%	0.89	3.1	3-class, imbalanced test set
`emotion`	84.7%	0.82	4.2	7-class, macro F1
`detect_language`	96.1%	0.95	1.8	ms/en/zh/ta
`normalize`	88.4%	0.86	2.3	Character-level edit distance < 2
`formalize`	82.9%	0.80	5.1	BLEU-4 against human references
`tokenize`	97.3%	0.97	0.9	Token boundary F1
`stem_word`	93.8%	0.93	0.4	Morphological root accuracy
`ner_tag`	87.5%	0.85	6.7	PER/ORG/LOC/MISC, span-level
`pos_tag`	94.2%	0.94	3.8	Universal Dependencies tagset
`extract_keywords`	79.6%	0.77	8.2	Recall@5 vs human annotations
`segment`	95.8%	0.96	1.2	Sentence boundary detection
`similarity`	82.1%	-	4.5	Spearman ρ on STS benchmark (ms)
`augment`	-	-	12.3	Preservation rate: 94%
`correct`	86.3%	0.84	7.8	Error correction rate
`code_switching`	89.7%	0.88	5.9	Switch-point detection
`intent`	90.4%	0.89	3.4	7-class intent
`topic`	88.1%	0.87	3.6	10-class topic
`hate_speech`	92.8%	0.91	3.2	Binary + severity
`stance`	83.5%	0.81	4.1	3-class stance
`summarization`	-	-	45.2	ROUGE-L: 0.41
`translation`	-	-	18.7	BLEU-4: 38.2 (ms→en)
`qa`	85.9%	0.84	11.3	Exact match on ms-SQuAD
`text_generation`	-	-	22.4	Perplexity: 18.7

Throughput¶

Texts processed per second (single thread, batch=1).

Module	Texts/sec	Batch (64) texts/sec
`sentiment`	23,400	89,200
`emotion`	18,100	71,500
`detect_language`	41,600	152,000
`normalize`	32,800	128,400
`tokenize`	86,500	310,000
`ner_tag`	11,200	42,300
`pos_tag`	19,700	74,800
`pipeline (clean+norm+sentiment)`	8,900	34,100

Comparison: malaysian-manglish-nlp vs Malaya¶

Fair comparison on identical Manglish test sets. Malaya v5.x tested with same hardware.

Task	malaysian-manglish-nlp	Malaya	Δ	Notes
Sentiment (3-class)	91.2%	84.7%	+6.5	Malaya trained on formal Malay
Emotion	84.7%	78.3%	+6.4	Malaya: 6-class vs our 7-class
NER	87.5%	89.1%	-1.6	Malaya has larger training set
POS tagging	94.2%	95.8%	-1.6	Malaya uses bigger corpus
Code-switching	89.7%	-	-	Malaya lacks this module
Normalization	88.4%	81.2%	+7.2	Malaya doesn't handle Manglish slang
Hate speech	92.8%	86.4%	+6.4	Malaya: formal text only
Translation (ms→en)	-	BLEU 42.1	-	Malaya uses larger parallel corpus
Import time	0.42s	3.8s	-3.4s	Malaya loads TensorFlow eagerly
Memory	180 MB	1.2 GB	-1 GB	Malaya: full TF runtime
Throughput (sentiment)	23.4k/s	4.1k/s	5.7� -	CPU inference comparison

Where Malaya wins¶

NER & POS: Larger annotated corpora for formal Malay
Translation: More parallel data, better BLEU scores
Speech: Malaya has TTS/STT; malaysian-manglish-nlp does not (yet)

Where malaysian-manglish-nlp wins¶

Manglish/informal text: Purpose-built for code-mixed content
Speed: Lightweight models, no heavy runtime dependency
Code-switching detection: Malaya lacks this entirely
Memory: 6� - smaller footprint
Hate speech on social media: Trained on real Malaysian social corpus

Performance Over Time¶

Latency (ms) per text across versions:

Version	Sentiment	NER	Pipeline	Import
v1.0.0	12.4	28.7	45.2	2.1s
v2.0.0	5.8	11.3	18.9	0.9s
v3.0.0	3.1	6.7	8.9	0.42s

Improvement from v1 → v3: - Sentiment: 4� - faster - NER: 4.3� - faster - Import: 5� - faster

Methodology¶

Test Corpus¶

Source: Malaysian Twitter/X, Reddit r/malaysia, Lowyat forums, WhatsApp messages (anonymized)
Size: 10,247 annotated samples across all tasks
Annotation: 3 native Malay speakers, inter-annotator agreement κ = 0.84
Split: 70/15/15 train/dev/test (stratified by label)
Code-mixing ratio: 42% pure Malay, 31% Manglish, 18% ms-en mix, 9% other

Evaluation Protocol¶

All metrics reported on held-out test set only (no data leakage)
Models evaluated in inference-only mode (no fine-tuning on test data)
Latency measured as median over 1000 runs after 100-run warmup
Throughput measured with sequential single-threaded processing
Batch throughput uses batch size 64 with pre-tokenized inputs

Reproducibility¶

All benchmark scripts included in the repo:

# Run full benchmark suite
python benchmarks/run_all.py

# Run specific module benchmark
python benchmarks/bench_sentiment.py --samples 1000

# Compare against Malaya
python benchmarks/compare_malaya.py --modules sentiment,ner,pos

Hardware¶

Benchmarks run on:

Component	Spec
CPU	AMD Ryzen 7 5800X (8C/16T)
RAM	32 GB DDR4-3600
Storage	NVMe SSD
OS	Windows 11 / Ubuntu 22.04
Python	3.11.7
Malaya	v5.1.0 (for comparison)

GPU benchmarks (where applicable):

GPU	Sentiment tps	NER tps
CPU only	23,400	11,200
RTX 3060	142,000	67,800
RTX 4090	318,000	154,200

Run Your Own Benchmarks¶

# Install benchmark dependencies
pip install malaysian-manglish-nlp[bench]

# Quick smoke test (< 1 minute)
python -m malaysian_manglish_nlp.benchmarks --quick

# Full suite (~15 minutes)
python -m malaysian_manglish_nlp.benchmarks --full

# Custom corpus
python -m malaysian_manglish_nlp.benchmarks --data path/to/your/corpus.jsonl

# Export results
python -m malaysian_manglish_nlp.benchmarks --full --output results.json

Interpreting Results¶

Accuracy/F1: Higher is better. F1 accounts for class imbalance.
Latency: Median ms per text. Lower is better.
Throughput: Texts/sec. Higher is better.
Memory: RSS after model load. Lower is better.

If your hardware differs significantly from our benchmark machine, expect proportional scaling. CPU clock speed matters most for single-text latency; core count matters for batch throughput.