Skip to content

Module Reference

51 production-ready NLP modules for Malaysian text - zero configuration, one import.


Overview

malaysian-manglish-nlp organises every module into eight functional groups. All follow a consistent API: mnlp.<module>(text, **options). Core modules run with zero external dependencies; advanced/generation modules need the optional [ml] extra.

pip install malaysian-manglish-nlp        # core only (text processing, analysis, extraction)
pip install malaysian-manglish-nlp[ml]    # + transformer models (generation, advanced)
pip install malaysian-manglish-nlp[all]   # everything including spaCy & LangChain

Module Grid

Text Processing

  • normalize - Informal → standard spelling (12k+ shortform mappings)
  • clean - Strip noise: URLs, mentions, emojis, repeated chars
  • formalize - Casual Manglish → formal Bahasa Melayu
  • tokenize - Malaysian-aware tokeniser (word / sentence / subword)
  • stemmer - Rule-based Malay affix stripping (me-/ber-/di-/-kan/-an/-i)
  • segment - Split concatenated text, hashtags, URLs
  • spelling - Context-aware spelling correction

Analysis

  • sentiment - Positive / negative / neutral with aspect-based option
  • emotion - 8 emotion labels + intensity scoring
  • language - Language & dialect detection (BM, EN, Manglish, Kelantan, Kedah…)
  • profanity - Profanity filter with severity levels & censor modes
  • sarcasm - Sarcasm / irony detection with cue explanation

Extraction

  • ner - 7 entity types including Malaysian names, places, currency
  • pos - UD-based POS tagging adapted for Malay grammar
  • dependency - Dependency parsing with tree visualisation
  • coreference - Pronoun & mention resolution
  • keywords - TF-IDF / TextRank / YAKE keyword extraction

Advanced

  • code_switching - Detect language switch points & patterns
  • intent - 8 intent categories + slot filling for chatbots
  • topic - Topic classification & unsupervised topic modelling
  • hate_speech - 3 severity levels across 6 target categories
  • stance - Support / oppose / neutral stance detection
  • discourse - Rhetorical relation parsing (cause, contrast, concession…)
  • coreference - Cross-sentence entity linking

Generation

  • translation - BM ↔ EN ↔ Manglish with entity preservation
  • summarization - Extractive & abstractive summaries
  • text_generation - Controlled text generation (style, format, temperature)
  • qa - Extractive & generative QA with conversational sessions

Data & Embeddings

  • word_embeddings - 300-dim Word2Vec trained on 10M+ Malaysian texts
  • embeddings - 768-dim sentence/document embeddings (fast & accurate modes)
  • similarity - Cosine / Jaccard / WMD semantic similarity
  • augmentation - 6 augmentation strategies for Malaysian text
  • dictionary - Lexical resource with definitions, slang, frequency data
  • spelling - Context-aware spelling correction with informal preservation

Tools & Utilities

  • ocr_normalize - Fix OCR artefacts in Malay documents
  • pipeline - Chain modules into reusable, serialisable workflows
  • calibration - Calibrate confidence scores (Platt / isotonic / temperature)
  • evaluate - Accuracy, F1, cross-validation, error analysis
  • hybrid_ml - Rule-first routing with ML fallback
  • tuning - Grid / random / Bayesian hyperparameter search
  • profiler - Latency, memory, and throughput benchmarking
  • cache - Memory / disk / Redis caching with TTL & warm-up

Integrations

  • spacy - Drop-in spaCy pipeline components
  • rest_api - FastAPI server with Swagger docs
  • cli - Full CLI for every module, file processing, pipelines
  • langchain - LangChain tool wrappers for agent usage

Universal API Pattern

Every module follows the same call convention:

import malaysian_manglish_nlp as mnlp

# Single text
result = mnlp.<module>(text)

# With options
result = mnlp.<module>(text, lang="ms", detailed=True)

# Batch (list input → list output)
results = mnlp.<module>(["text1", "text2", "text3"])

Dependency Tiers

┌──────────────────────────────────────────────────────┐
│  Tier 0  -  Core (zero external deps)                  │
│  normalize, clean, tokenize, stem, segment, spelling │
├──────────────────────────────────────────────────────┤
│  Tier 1  -  Analysis (lightweight models)              │
│  sentiment, ner, pos, keywords, language, profanity  │
├──────────────────────────────────────────────────────┤
│  Tier 2  -  Advanced (optional ML)                     │
│  code_switching, intent, topic, hate_speech, stance  │
├──────────────────────────────────────────────────────┤
│  Tier 3  -  Generation (requires [ml])                 │
│  translate, summarize, generate, qa, embeddings      │
├──────────────────────────────────────────────────────┤
│  Tier 4  -  Integrations (requires [spacy]/[langchain])│
│  spacy, rest_api, langchain                          │
└──────────────────────────────────────────────────────┘

Optional Dependencies

Core modules have zero external dependencies. Install extras only when needed:

pip install malaysian-manglish-nlp[ml]       # Tier 2-3
pip install malaysian-manglish-nlp[spacy]    # spaCy integration
pip install malaysian-manglish-nlp[langchain]# LangChain tools
pip install malaysian-manglish-nlp[all]      # Everything