Named Entity Recognition¶
Extract persons, organisations, locations, and Malaysian-specific entities from text.
Why NER?¶
Information extraction, knowledge graph construction, content tagging, and search indexing for Malaysian text. Standard NER models trained on news corpora miss Malaysian-specific entities like "Petronas", "UMNO", "nasi lemak", or informal place names like "KL", "Penang".
malaysian-manglish-nlp recognises entities in Manglish, social media text, and informal writing.
Load module¶
import malaysian_manglish_nlp as mnlp
entities = mnlp.ner_tag("Ahmad kerja kat Petronas Tower KL")
print(entities)
# [('Ahmad', 'PERSON'), ('Petronas Tower', 'ORG'), ('KL', 'LOCATION')]
Basic usage¶
Simple extraction¶
mnlp.ner_tag("Ali pergi kedai Mamak dekat Bukit Bintang")
# [('Ali', 'PERSON'), ('Mamak', 'ORG'), ('Bukit Bintang', 'LOCATION')]
mnlp.ner_tag("Harga minyak RON95 naik RM0.20 minggu ni")
# [('RON95', 'PRODUCT'), ('RM0.20', 'MONEY')]
mnlp.ner_tag("PM Ismail Sabri melawat Sabah hari Isnin")
# [('Ismail Sabri', 'PERSON'), ('Sabah', 'LOCATION'), ('Isnin', 'DATE')]
Raw tuple output¶
Each entity is a (text, label) tuple:
entities = mnlp.ner_tag("Dr. Siti bekerja di Hospital Kuala Lumpur")
for text, label in entities:
print(f"{text:25s} → {label}")
# Dr. Siti → PERSON
# Hospital Kuala Lumpur → ORG
Entity types¶
malaysian-manglish-nlp recognises these entity categories:
| Type | Description | Examples |
|---|---|---|
PERSON |
Names of people | Ahmad, Siti Nurhaliza, Dr. Mahathir |
ORG |
Organisations | Petronas, UMNO, Maybank, UMP |
LOCATION |
Places, addresses | KL, Penang, Bukit Bintang, Sabah |
DATE |
Dates, temporal | Isnin, 2026, esok, minggu depan |
TIME |
Time expressions | pukul 3, 5:30 petang, tengah malam |
MONEY |
Currency amounts | RM50, RM1,200, 50 sen |
PRODUCT |
Products, brands | RON95, iPhone, Myvi, nasi lemak |
EVENT |
Events | Hari Raya, Merdeka, PRU15 |
QUANTITY |
Measurements | 5 kg, 100 km, 3 liter |
Examples for each type¶
# PERSON
mnlp.ner_tag("Siti Nurhaliza nyanyi kat KLCC")
# [('Siti Nurhaliza', 'PERSON'), ('KLCC', 'LOCATION')]
# ORG
mnlp.ner_tag("Student UMP menang competition anjuran Google")
# [('UMP', 'ORG'), ('Google', 'ORG')]
# LOCATION
mnlp.ner_tag("Jom lepak mamak dekat SS15 Subang")
# [('SS15', 'LOCATION'), ('Subang', 'LOCATION')]
# DATE
mnlp.ner_tag("Deadline submission 15 Jun 2026")
# [('15 Jun 2026', 'DATE')]
# MONEY
mnlp.ner_tag("Nasi lemak RM1.50, teh tarik RM2")
# [('RM1.50', 'MONEY'), ('RM2', 'MONEY')]
# PRODUCT
mnlp.ner_tag("Myvi baru launch harga RM50k")
# [('Myvi', 'PRODUCT'), ('RM50k', 'MONEY')]
# EVENT
mnlp.ner_tag("Hari Raya tahun ni balik kampung Johor")
# [('Hari Raya', 'EVENT'), ('Johor', 'LOCATION')]
# QUANTITY
mnlp.ner_tag("Beli 2 kg ayam dengan 5 liter minyak")
# [('2 kg', 'QUANTITY'), ('5 liter', 'QUANTITY')]
Handling Manglish text¶
NER works on informal text without preprocessing:
# Informal Manglish
mnlp.ner_tag("weh jom makan kat mamak ali dekat bangsar")
# [('ali', 'PERSON'), ('bangsar', 'LOCATION')]
# With abbreviations
mnlp.ner_tag("Kerja kat TTDI, rumah dekat PJ")
# [('TTDI', 'LOCATION'), ('PJ', 'LOCATION')]
# Mixed language
mnlp.ner_tag("Meeting with Dato' Sri at Hilton KL tomorrow")
# [('Dato' Sri', 'PERSON'), ('Hilton KL', 'ORG'), ('tomorrow', 'DATE')]
Case sensitivity
NER works with both cased and uncased text. Capitalised input improves accuracy, but lowercase Manglish still produces good results.
Custom entities¶
For domain-specific entities, combine NER with keyword extraction:
# Extract entities + keywords for full coverage
text = "Server AWS down since 3am, affected all users in Malaysia region"
entities = mnlp.ner_tag(text)
keywords = mnlp.extract_keywords(text)
print("Entities:", entities)
print("Keywords:", keywords[:5])
Batch processing¶
Process multiple texts efficiently:
texts = [
"Ali pergi kedai Mamak KL",
"Siti beli Myvi baru RM45k",
"Meeting UMP pukul 3 petang",
]
for text in texts:
entities = mnlp.ner_tag(text)
print(f"{text:35s} → {entities}")
CLI usage¶
# Basic NER
$ mnlp ner "Ahmad kerja kat Petronas KL"
Ahmad → PERSON
Petronas → ORG
KL → LOCATION
# JSON output
$ mnlp ner "Siti beli Myvi RM50k" --json
[["Siti", "PERSON"], ["Myvi", "PRODUCT"], ["RM50k", "MONEY"]]
# Pipe input
$ echo "Meeting UMP pukul 3" | mnlp ner
UMP → ORG
pukul 3 → TIME
# Full analysis (includes NER)
$ mnlp analyze "Ali pergi Petronas beli RON95 RM50"
How it works¶
- Tokenization - text split into tokens with Malaysian-aware rules
- Pattern matching - gazetteers for Malaysian names, places, organisations
- Context features - surrounding words, capitalisation, position
- Rule engine - regex patterns for MONEY, DATE, TIME, QUANTITY
- Post-processing - merge adjacent entities, resolve conflicts
Performance¶
| Metric | Score |
|---|---|
| Overall F1 | 84.3% |
| PERSON F1 | 89.1% |
| ORG F1 | 82.7% |
| LOCATION F1 | 87.5% |
| MONEY F1 | 95.2% |
| Throughput | 18,000 texts/sec |
| Latency (single) | < 1ms |
Benchmarked on 3,000 annotated Malaysian texts. See Benchmarks for full details.
See also¶
- Sentiment Analysis - combine with NER for aspect-based sentiment
- Pipeline - run NER as part of a multi-step pipeline
- REST API - serve NER over HTTP
- API Reference - full function signature