Module Reference¶
51 production-ready NLP modules for Malaysian text - zero configuration, one import.
Overview¶
malaysian-manglish-nlp organises every module into eight functional groups. All follow a consistent API: mnlp.<module>(text, **options). Core modules run with zero external dependencies; advanced/generation modules need the optional [ml] extra.
pip install malaysian-manglish-nlp # core only (text processing, analysis, extraction)
pip install malaysian-manglish-nlp[ml] # + transformer models (generation, advanced)
pip install malaysian-manglish-nlp[all] # everything including spaCy & LangChain
Module Grid¶
Text Processing¶
normalize- Informal → standard spelling (12k+ shortform mappings)clean- Strip noise: URLs, mentions, emojis, repeated charsformalize- Casual Manglish → formal Bahasa Melayutokenize- Malaysian-aware tokeniser (word / sentence / subword)stemmer- Rule-based Malay affix stripping (me-/ber-/di-/-kan/-an/-i)segment- Split concatenated text, hashtags, URLsspelling- Context-aware spelling correction
Analysis¶
sentiment- Positive / negative / neutral with aspect-based optionemotion- 8 emotion labels + intensity scoringlanguage- Language & dialect detection (BM, EN, Manglish, Kelantan, Kedah…)profanity- Profanity filter with severity levels & censor modessarcasm- Sarcasm / irony detection with cue explanation
Extraction¶
ner- 7 entity types including Malaysian names, places, currencypos- UD-based POS tagging adapted for Malay grammardependency- Dependency parsing with tree visualisationcoreference- Pronoun & mention resolutionkeywords- TF-IDF / TextRank / YAKE keyword extraction
Advanced¶
code_switching- Detect language switch points & patternsintent- 8 intent categories + slot filling for chatbotstopic- Topic classification & unsupervised topic modellinghate_speech- 3 severity levels across 6 target categoriesstance- Support / oppose / neutral stance detectiondiscourse- Rhetorical relation parsing (cause, contrast, concession…)coreference- Cross-sentence entity linking
Generation¶
translation- BM ↔ EN ↔ Manglish with entity preservationsummarization- Extractive & abstractive summariestext_generation- Controlled text generation (style, format, temperature)qa- Extractive & generative QA with conversational sessions
Data & Embeddings¶
word_embeddings- 300-dim Word2Vec trained on 10M+ Malaysian textsembeddings- 768-dim sentence/document embeddings (fast & accurate modes)similarity- Cosine / Jaccard / WMD semantic similarityaugmentation- 6 augmentation strategies for Malaysian textdictionary- Lexical resource with definitions, slang, frequency dataspelling- Context-aware spelling correction with informal preservation
Tools & Utilities¶
ocr_normalize- Fix OCR artefacts in Malay documentspipeline- Chain modules into reusable, serialisable workflowscalibration- Calibrate confidence scores (Platt / isotonic / temperature)evaluate- Accuracy, F1, cross-validation, error analysishybrid_ml- Rule-first routing with ML fallbacktuning- Grid / random / Bayesian hyperparameter searchprofiler- Latency, memory, and throughput benchmarkingcache- Memory / disk / Redis caching with TTL & warm-up
Integrations¶
Universal API Pattern¶
Every module follows the same call convention:
import malaysian_manglish_nlp as mnlp
# Single text
result = mnlp.<module>(text)
# With options
result = mnlp.<module>(text, lang="ms", detailed=True)
# Batch (list input → list output)
results = mnlp.<module>(["text1", "text2", "text3"])
Dependency Tiers¶
┌──────────────────────────────────────────────────────┐
│ Tier 0 - Core (zero external deps) │
│ normalize, clean, tokenize, stem, segment, spelling │
├──────────────────────────────────────────────────────┤
│ Tier 1 - Analysis (lightweight models) │
│ sentiment, ner, pos, keywords, language, profanity │
├──────────────────────────────────────────────────────┤
│ Tier 2 - Advanced (optional ML) │
│ code_switching, intent, topic, hate_speech, stance │
├──────────────────────────────────────────────────────┤
│ Tier 3 - Generation (requires [ml]) │
│ translate, summarize, generate, qa, embeddings │
├──────────────────────────────────────────────────────┤
│ Tier 4 - Integrations (requires [spacy]/[langchain])│
│ spacy, rest_api, langchain │
└──────────────────────────────────────────────────────┘
Optional Dependencies
Core modules have zero external dependencies. Install extras only when needed: