Acknowledgement¶

malaysian-manglish-nlp stands on the shoulders of many excellent open-source projects and the broader Malaysian NLP community.

Tools & Libraries¶

Project	Role
spaCy	NLP pipeline architecture inspiration
Malaya	Pioneer Malaysian NLP toolkit - primary design reference
HuggingFace Transformers	XLM-Roberta model and fine-tuning framework
HuggingFace Datasets	Dataset loading and management
Gensim	Word2Vec and FastText training
NetworkX	Graph algorithms for text graph analysis
NumPy	Numerical computation backbone
pandas	Dataset manipulation
FastAPI	REST API server
pytest	Testing framework

Data Sources¶

Source	Usage
Twitter/X Malaysian users	Sentiment training data
Lowyat forum posts	Sentiment and slang data
Reddit r/malaysia	Sentiment data
Malaysian news portals	NER annotation source
DBP (Dewan Bahasa dan Pustaka)	Dictionary and standard Malay reference
Bernama	News text for NER
Berita Harian	News text for NER

Research & Models¶

Resource	Contribution
Mesolitica NanoT5	Baseline Malay sentiment model for comparison
huseinzol05/malaysian-sentiment	Sentiment analysis benchmark
XLM-Roberta base	Base model for fine-tuned classifier (v3.2.0)
DistilBERT multilingual	Previous base model (v3.0.0-v3.1.0)
CoNLL-2003 NER format	Annotation standard for NER dataset

Academic Support¶

UMP (Universiti Malaysia Pahang) - Academic institution and Final Year Project supervision
Faculty of Computing, UMP - Infrastructure and guidance

Community¶

Malaysian NLP community - feedback, dataset contributions, and bug reports
Open source contributors on GitHub - pull requests, issues, and suggestions
r/malaysia and Lowyat community members whose public posts (anonymised) form part of the training data

Inspiration¶

malaysian-manglish-nlp was heavily inspired by Malaya by Hussein Zolkepli. Malaya pioneered the Malaysian NLP toolkit space and set the standard for API design, documentation quality, and model coverage. This project aims to complement Malaya by focusing specifically on the Manglish code-switching register.

References¶

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of ACL 2020, 8440–8451.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the ACL, 5, 135–146.
Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proceedings of CoNLL-2003, 142–147.