Text Normalization¶
Convert informal Manglish into clean, standard text - shortform expansion, noise removal, and formalization.
Why normalization?¶
Malaysian social media text is full of shortforms ("nk", "brp", "xpe"), slang, elongated words ("bestttt"), and noise (URLs, mentions, repeated characters). Most NLP tools choke on this. Normalization converts messy input into clean text that downstream modules can process accurately.
malaysian-manglish-nlp has 638+ shortform mappings and handles Malaysian-specific patterns natively.
Load module¶
import malaysian_manglish_nlp as mnlp
# Basic normalization
result = mnlp.normalize("xpe la bro, aku nk g mkn jap lg")
print(result)
# "takpe la bro, aku nak pergi makan jap lagi"
Basic usage¶
Shortform expansion¶
malaysian-manglish-nlp knows 638+ Malaysian text shortforms:
mnlp.normalize("nk tnya brp hrga")
# "nak tanya berapa harga"
mnlp.normalize("sy xfhm ape ko ckp")
# "saya tak faham apa kau cakap"
mnlp.normalize("jap lg aku smpai")
# "sekejap lagi aku sampai"
mnlp.normalize("xpyh la, dh xnk pg")
# "tak payah la, dah tak nak pergi"
mnlp.normalize("blh tlg sy x?")
# "boleh tolong saya tak?"
Common shortform examples¶
| Input | Output |
|---|---|
nk |
nak |
brp / brapa |
berapa |
xpe / xpe la |
takpe / takpe la |
sy / sya |
saya |
ko / kau |
kau |
xfhm |
tak faham |
jap |
sekejap |
mkn |
makan |
pg / pergi |
pergi |
smpai |
sampai |
blh |
boleh |
tlg |
tolong |
dh / dah |
dah (already standard) |
dgn |
dengan |
utk |
untuk |
Clean text¶
Remove noise from social media text:
mnlp.clean("Best gila!!! Check out https://example.com @user #food #sedap")
# "Best gila"
mnlp.clean("Sedapppp gilaaaa πππ")
# "Sedap gila"
mnlp.clean("RT @user: Beli sekarang!!! LIMITED EDITION!!!")
# "Beli sekarang LIMITED EDITION"
Clean options¶
# Remove URLs only
mnlp.clean("Visit https://example.com for info", remove_urls=True)
# Remove mentions only
mnlp.clean("@user best gila", remove_mentions=True)
# Remove emojis only
mnlp.clean("Sedap ππ₯", remove_emojis=True)
# Remove hashtags only
mnlp.clean("Best #food #kl", remove_hashtags=True)
# Clean for NLP (strips everything, normalizes whitespace)
mnlp.clean_for_nlp("@user Besttt gila!!! π Check https://t.co/abc #food")
# "Best gila"
Formalize¶
Convert casual Manglish to formal Bahasa Melayu:
mnlp.formalize("aku nk g mkn jap")
# "saya hendak pergi makan sebentar"
mnlp.formalize("ko dah makan ke belum?")
# "awak sudah makan atau belum?"
mnlp.formalize("xpe la, aku tunggu je")
# "tidak mengapalah, saya tunggu sahaja"
Spelling correction¶
Fix misspelled words with context awareness:
mnlp.correct("Saya suka makn nasi lemak")
# "Saya suka makan nasi lemak"
mnlp.correct("Dia perghi kedai mamak")
# "Dia pergi kedai mamak"
# Single word correction
mnlp.correct_word("makn")
# "makan"
Contextual spelling¶
For more accurate correction that considers surrounding words:
When to use contextual spelling
Basic correct() is faster and good for obvious typos. Use correct_contextual() when words have multiple valid corrections and context matters.
Advanced normalization¶
Elongated words¶
mnlp.normalize_elongated("Bestttt gilaaaa sangatttt")
# "Best gila sangat"
mnlp.normalize_elongated("Sedapppp nauzubillahhh")
# "Sedap nauzubillah"
Money normalization¶
mnlp.normalize_money("harga 50 ringgit je")
# "harga RM50 je"
mnlp.normalize_money("bayar 1.5k sebulan")
# "bayar RM1500 sebulan"
Phone number normalization¶
Date and time normalization¶
mnlp.normalize_date("jumpa 15/6/2026")
# "jumpa 2026-06-15"
mnlp.normalize_time("pukul 3 ptg")
# "15:00"
Normalize all¶
Apply all normalizations at once:
mnlp.normalize_all("call sy 011-57048145, jmpa 15/6 jam 3ptg, byr RM50")
# All patterns normalized in one pass
Before/after examples¶
| Before | After |
|---|---|
xpe la bro, aku nk g mkn jap lg |
takpe la bro, aku nak pergi makan sekejap lagi |
Bestttt gilaaaa ππ |
Best gila |
sy xfhm la ape ko ckp ni |
saya tak faham la apa kau cakap ni |
brp hrga nasi lemak tu? |
berapa harga nasi lemak tu? |
@user check https://t.co/abc #food |
check #food (with clean) |
Chain normalizers¶
Combine normalization steps for best results:
text = "@user Bestttt gilaaaa!!! ππ nk order 2, brp hrga?"
# Step 1: Clean noise
clean = mnlp.clean(text)
# "Bestttt gilaaaa nk order 2, brp hrga?"
# Step 2: Fix elongated words
fixed = mnlp.normalize_elongated(clean)
# "Best gila nk order 2, brp hrga?"
# Step 3: Expand shortforms
final = mnlp.normalize(fixed)
# "Best gila nak order 2, berapa harga?"
Or use normalize_all for one-pass processing:
CLI usage¶
# Basic normalize
$ mnlp normalize "nk tnya brp hrga"
nak tanya berapa harga
# Clean text
$ mnlp clean "Best gila!!! ππ #food"
Best gila
# Formalize
$ mnlp formalize "aku nk g mkn jap"
saya hendak pergi makan sebentar
# Pipe chain
$ echo "xpe la best gila" | mnlp normalize | mnlp sentiment
"takpe la best gila" β positive (0.89)
# Spelling correction
$ mnlp correct "Saya suka makn nasi"
Saya suka makan nasi
How it works¶
- Dictionary lookup - 638+ shortform mappings (nkβnak, brpβberapa)
- Pattern matching - elongated chars (3+ repeats reduced), URLs, mentions
- Context rules - some shortforms depend on position/context
- Preserve particles - "la", "lah", "kan", "weh" stay intact
- Case handling -
normalize_preserve_casekeeps original casing
Performance¶
| Metric | Score |
|---|---|
| Shortform accuracy | 96.8% |
| Cleaning accuracy | 98.2% |
| Formalization accuracy | 87.5% |
| Throughput | 45,000 texts/sec |
| Latency (single) | < 0.2ms |
See also¶
- Sentiment Analysis - normalize before sentiment for better accuracy
- Translation - normalization is applied automatically during translation
- Language Detection - clean text improves detection
- Pipeline - chain normalization with other modules
- API Reference - full function signature