Text Normalization¶

Convert informal Manglish into clean, standard text - shortform expansion, noise removal, and formalization.

Why normalization?¶

Malaysian social media text is full of shortforms ("nk", "brp", "xpe"), slang, elongated words ("bestttt"), and noise (URLs, mentions, repeated characters). Most NLP tools choke on this. Normalization converts messy input into clean text that downstream modules can process accurately.

malaysian-manglish-nlp has 638+ shortform mappings and handles Malaysian-specific patterns natively.

Load module¶

import malaysian_manglish_nlp as mnlp

# Basic normalization
result = mnlp.normalize("xpe la bro, aku nk g mkn jap lg")
print(result)
# "takpe la bro, aku nak pergi makan jap lagi"

Basic usage¶

Shortform expansion¶

malaysian-manglish-nlp knows 638+ Malaysian text shortforms:

mnlp.normalize("nk tnya brp hrga")
# "nak tanya berapa harga"

mnlp.normalize("sy xfhm ape ko ckp")
# "saya tak faham apa kau cakap"

mnlp.normalize("jap lg aku smpai")
# "sekejap lagi aku sampai"

mnlp.normalize("xpyh la, dh xnk pg")
# "tak payah la, dah tak nak pergi"

mnlp.normalize("blh tlg sy x?")
# "boleh tolong saya tak?"

Common shortform examples¶

Input	Output
`nk`	`nak`
`brp` / `brapa`	`berapa`
`xpe` / `xpe la`	`takpe` / `takpe la`
`sy` / `sya`	`saya`
`ko` / `kau`	`kau`
`xfhm`	`tak faham`
`jap`	`sekejap`
`mkn`	`makan`
`pg` / `pergi`	`pergi`
`smpai`	`sampai`
`blh`	`boleh`
`tlg`	`tolong`
`dh` / `dah`	`dah` (already standard)
`dgn`	`dengan`
`utk`	`untuk`

Clean text¶

Remove noise from social media text:

mnlp.clean("Best gila!!! Check out https://example.com @user #food #sedap")
# "Best gila"

mnlp.clean("Sedapppp gilaaaa 😍😍😍")
# "Sedap gila"

mnlp.clean("RT @user: Beli sekarang!!! LIMITED EDITION!!!")
# "Beli sekarang LIMITED EDITION"

Clean options¶

# Remove URLs only
mnlp.clean("Visit https://example.com for info", remove_urls=True)

# Remove mentions only
mnlp.clean("@user best gila", remove_mentions=True)

# Remove emojis only
mnlp.clean("Sedap 😍🔥", remove_emojis=True)

# Remove hashtags only
mnlp.clean("Best #food #kl", remove_hashtags=True)

# Clean for NLP (strips everything, normalizes whitespace)
mnlp.clean_for_nlp("@user Besttt gila!!! 😍 Check https://t.co/abc #food")
# "Best gila"

Formalize¶

Convert casual Manglish to formal Bahasa Melayu:

mnlp.formalize("aku nk g mkn jap")
# "saya hendak pergi makan sebentar"

mnlp.formalize("ko dah makan ke belum?")
# "awak sudah makan atau belum?"

mnlp.formalize("xpe la, aku tunggu je")
# "tidak mengapalah, saya tunggu sahaja"

Spelling correction¶

Fix misspelled words with context awareness:

mnlp.correct("Saya suka makn nasi lemak")
# "Saya suka makan nasi lemak"

mnlp.correct("Dia perghi kedai mamak")
# "Dia pergi kedai mamak"

# Single word correction
mnlp.correct_word("makn")
# "makan"

Contextual spelling¶

For more accurate correction that considers surrounding words:

mnlp.correct_contextual("Saya nak pergy kedai makn")
# "Saya nak pergi kedai makan"

When to use contextual spelling

Basic correct() is faster and good for obvious typos. Use correct_contextual() when words have multiple valid corrections and context matters.

Advanced normalization¶

Elongated words¶

mnlp.normalize_elongated("Bestttt gilaaaa sangatttt")
# "Best gila sangat"

mnlp.normalize_elongated("Sedapppp nauzubillahhh")
# "Sedap nauzubillah"

Money normalization¶

mnlp.normalize_money("harga 50 ringgit je")
# "harga RM50 je"

mnlp.normalize_money("bayar 1.5k sebulan")
# "bayar RM1500 sebulan"

Phone number normalization¶

mnlp.normalize_phone("call aku 011-57048145")
# "call aku +601157048145"

Date and time normalization¶

mnlp.normalize_date("jumpa 15/6/2026")
# "jumpa 2026-06-15"

mnlp.normalize_time("pukul 3 ptg")
# "15:00"

Normalize all¶

Apply all normalizations at once:

mnlp.normalize_all("call sy 011-57048145, jmpa 15/6 jam 3ptg, byr RM50")
# All patterns normalized in one pass

Before/after examples¶

Before	After
`xpe la bro, aku nk g mkn jap lg`	`takpe la bro, aku nak pergi makan sekejap lagi`
`Bestttt gilaaaa 😍😍`	`Best gila`
`sy xfhm la ape ko ckp ni`	`saya tak faham la apa kau cakap ni`
`brp hrga nasi lemak tu?`	`berapa harga nasi lemak tu?`
`@user check https://t.co/abc #food`	`check #food` (with clean)

Chain normalizers¶

Combine normalization steps for best results:

text = "@user Bestttt gilaaaa!!! 😍😍 nk order 2, brp hrga?"

# Step 1: Clean noise
clean = mnlp.clean(text)
# "Bestttt gilaaaa nk order 2, brp hrga?"

# Step 2: Fix elongated words
fixed = mnlp.normalize_elongated(clean)
# "Best gila nk order 2, brp hrga?"

# Step 3: Expand shortforms
final = mnlp.normalize(fixed)
# "Best gila nak order 2, berapa harga?"

Or use normalize_all for one-pass processing:

mnlp.normalize_all("@user Bestttt gilaaaa!!! nk order, brp hrga?")

CLI usage¶

# Basic normalize
$ mnlp normalize "nk tnya brp hrga"
nak tanya berapa harga

# Clean text
$ mnlp clean "Best gila!!! 😍😍 #food"
Best gila

# Formalize
$ mnlp formalize "aku nk g mkn jap"
saya hendak pergi makan sebentar

# Pipe chain
$ echo "xpe la best gila" | mnlp normalize | mnlp sentiment
"takpe la best gila" → positive (0.89)

# Spelling correction
$ mnlp correct "Saya suka makn nasi"
Saya suka makan nasi

How it works¶

Dictionary lookup - 638+ shortform mappings (nk→nak, brp→berapa)
Pattern matching - elongated chars (3+ repeats reduced), URLs, mentions
Context rules - some shortforms depend on position/context
Preserve particles - "la", "lah", "kan", "weh" stay intact
Case handling - normalize_preserve_case keeps original casing

Performance¶

Metric	Score
Shortform accuracy	96.8%
Cleaning accuracy	98.2%
Formalization accuracy	87.5%
Throughput	45,000 texts/sec
Latency (single)	< 0.2ms