Skip to content

Text Normalization

Convert informal Manglish into clean, standard text - shortform expansion, noise removal, and formalization.


Why normalization?

Malaysian social media text is full of shortforms ("nk", "brp", "xpe"), slang, elongated words ("bestttt"), and noise (URLs, mentions, repeated characters). Most NLP tools choke on this. Normalization converts messy input into clean text that downstream modules can process accurately.

malaysian-manglish-nlp has 638+ shortform mappings and handles Malaysian-specific patterns natively.


Load module

import malaysian_manglish_nlp as mnlp

# Basic normalization
result = mnlp.normalize("xpe la bro, aku nk g mkn jap lg")
print(result)
# "takpe la bro, aku nak pergi makan jap lagi"

Basic usage

Shortform expansion

malaysian-manglish-nlp knows 638+ Malaysian text shortforms:

mnlp.normalize("nk tnya brp hrga")
# "nak tanya berapa harga"

mnlp.normalize("sy xfhm ape ko ckp")
# "saya tak faham apa kau cakap"

mnlp.normalize("jap lg aku smpai")
# "sekejap lagi aku sampai"

mnlp.normalize("xpyh la, dh xnk pg")
# "tak payah la, dah tak nak pergi"

mnlp.normalize("blh tlg sy x?")
# "boleh tolong saya tak?"

Common shortform examples

Input Output
nk nak
brp / brapa berapa
xpe / xpe la takpe / takpe la
sy / sya saya
ko / kau kau
xfhm tak faham
jap sekejap
mkn makan
pg / pergi pergi
smpai sampai
blh boleh
tlg tolong
dh / dah dah (already standard)
dgn dengan
utk untuk

Clean text

Remove noise from social media text:

mnlp.clean("Best gila!!! Check out https://example.com @user #food #sedap")
# "Best gila"

mnlp.clean("Sedapppp gilaaaa 😍😍😍")
# "Sedap gila"

mnlp.clean("RT @user: Beli sekarang!!! LIMITED EDITION!!!")
# "Beli sekarang LIMITED EDITION"

Clean options

# Remove URLs only
mnlp.clean("Visit https://example.com for info", remove_urls=True)

# Remove mentions only
mnlp.clean("@user best gila", remove_mentions=True)

# Remove emojis only
mnlp.clean("Sedap 😍πŸ”₯", remove_emojis=True)

# Remove hashtags only
mnlp.clean("Best #food #kl", remove_hashtags=True)

# Clean for NLP (strips everything, normalizes whitespace)
mnlp.clean_for_nlp("@user Besttt gila!!! 😍 Check https://t.co/abc #food")
# "Best gila"

Formalize

Convert casual Manglish to formal Bahasa Melayu:

mnlp.formalize("aku nk g mkn jap")
# "saya hendak pergi makan sebentar"

mnlp.formalize("ko dah makan ke belum?")
# "awak sudah makan atau belum?"

mnlp.formalize("xpe la, aku tunggu je")
# "tidak mengapalah, saya tunggu sahaja"

Spelling correction

Fix misspelled words with context awareness:

mnlp.correct("Saya suka makn nasi lemak")
# "Saya suka makan nasi lemak"

mnlp.correct("Dia perghi kedai mamak")
# "Dia pergi kedai mamak"

# Single word correction
mnlp.correct_word("makn")
# "makan"

Contextual spelling

For more accurate correction that considers surrounding words:

mnlp.correct_contextual("Saya nak pergy kedai makn")
# "Saya nak pergi kedai makan"

When to use contextual spelling

Basic correct() is faster and good for obvious typos. Use correct_contextual() when words have multiple valid corrections and context matters.


Advanced normalization

Elongated words

mnlp.normalize_elongated("Bestttt gilaaaa sangatttt")
# "Best gila sangat"

mnlp.normalize_elongated("Sedapppp nauzubillahhh")
# "Sedap nauzubillah"

Money normalization

mnlp.normalize_money("harga 50 ringgit je")
# "harga RM50 je"

mnlp.normalize_money("bayar 1.5k sebulan")
# "bayar RM1500 sebulan"

Phone number normalization

mnlp.normalize_phone("call aku 011-57048145")
# "call aku +601157048145"

Date and time normalization

mnlp.normalize_date("jumpa 15/6/2026")
# "jumpa 2026-06-15"

mnlp.normalize_time("pukul 3 ptg")
# "15:00"

Normalize all

Apply all normalizations at once:

mnlp.normalize_all("call sy 011-57048145, jmpa 15/6 jam 3ptg, byr RM50")
# All patterns normalized in one pass

Before/after examples

Before After
xpe la bro, aku nk g mkn jap lg takpe la bro, aku nak pergi makan sekejap lagi
Bestttt gilaaaa 😍😍 Best gila
sy xfhm la ape ko ckp ni saya tak faham la apa kau cakap ni
brp hrga nasi lemak tu? berapa harga nasi lemak tu?
@user check https://t.co/abc #food check #food (with clean)

Chain normalizers

Combine normalization steps for best results:

text = "@user Bestttt gilaaaa!!! 😍😍 nk order 2, brp hrga?"

# Step 1: Clean noise
clean = mnlp.clean(text)
# "Bestttt gilaaaa nk order 2, brp hrga?"

# Step 2: Fix elongated words
fixed = mnlp.normalize_elongated(clean)
# "Best gila nk order 2, brp hrga?"

# Step 3: Expand shortforms
final = mnlp.normalize(fixed)
# "Best gila nak order 2, berapa harga?"

Or use normalize_all for one-pass processing:

mnlp.normalize_all("@user Bestttt gilaaaa!!! nk order, brp hrga?")

CLI usage

# Basic normalize
$ mnlp normalize "nk tnya brp hrga"
nak tanya berapa harga

# Clean text
$ mnlp clean "Best gila!!! 😍😍 #food"
Best gila

# Formalize
$ mnlp formalize "aku nk g mkn jap"
saya hendak pergi makan sebentar

# Pipe chain
$ echo "xpe la best gila" | mnlp normalize | mnlp sentiment
"takpe la best gila" β†’ positive (0.89)

# Spelling correction
$ mnlp correct "Saya suka makn nasi"
Saya suka makan nasi

How it works

  1. Dictionary lookup - 638+ shortform mappings (nk→nak, brp→berapa)
  2. Pattern matching - elongated chars (3+ repeats reduced), URLs, mentions
  3. Context rules - some shortforms depend on position/context
  4. Preserve particles - "la", "lah", "kan", "weh" stay intact
  5. Case handling - normalize_preserve_case keeps original casing

Performance

Metric Score
Shortform accuracy 96.8%
Cleaning accuracy 98.2%
Formalization accuracy 87.5%
Throughput 45,000 texts/sec
Latency (single) < 0.2ms

See also