Extraction¶

Pull structured data from Malaysian text - entities, grammar, keywords, and references.

Overview¶

Extraction modules convert unstructured text into structured representations: named entities, part-of-speech tags, dependency trees, keyword lists, and coreference chains. All modules handle Manglish and code-switched input natively.

import malaysian_manglish_nlp as mnlp

Quick Start¶

import malaysian_manglish_nlp as mnlp

text = "Siti beli iPhone 15 kat Low Yat Plaza semalam, harga RM4,200"

mnlp.ner(text)
# [('Siti', 'PERSON'), ('iPhone 15', 'PRODUCT'),
#  ('Low Yat Plaza', 'LOCATION'), ('semalam', 'DATE'), ('RM4,200', 'MONEY')]

mnlp.pos(text)
# [('Siti', 'PROPN'), ('beli', 'VERB'), ('iPhone', 'PROPN'), ('15', 'NUM'),
#  ('kat', 'ADP'), ('Low', 'PROPN'), ('Yat', 'PROPN'), ('Plaza', 'PROPN'),
#  ('semalam', 'NOUN'), (',', 'PUNCT'), ('harga', 'NOUN'), ('RM4,200', 'NUM')]

mnlp.keywords(text, top_k=3)
# ['iPhone 15', 'Low Yat Plaza', 'RM4,200']

Module Details¶

`ner`¶

Named Entity Recognition with 7 entity types trained on Malaysian text. Handles Malay, Chinese, Indian, and mixed names with honorifics (Dato', Tan Sri, Dr.).

import malaysian_manglish_nlp as mnlp

mnlp.ner("Dato' Sri Ismail Sabri umum bantuan RM500 di Putrajaya")
# [('Dato' Sri Ismail Sabri', 'PERSON'), ('RM500', 'MONEY'),
#  ('Putrajaya', 'LOCATION')]

Entity Types¶

Entity	Description	Examples
`PERSON`	Person names with titles	Siti Nurhaliza, Dr. Mahathir, Dato' Seri Anwar
`LOCATION`	Places, cities, addresses	KL, Bukit Bintang, Pahang, Low Yat Plaza
`ORG`	Organisations	Petronas, UMP, Grab, Shopee
`PRODUCT`	Products, brands, models	iPhone 15, Myvi, Milo, Samsung Galaxy
`EVENT`	Events, holidays	Hari Raya, Merdeka Day, PRU15
`DATE`	Dates, times, durations	semalam, 15 Mei 2024, next week, 2 jam
`MONEY`	Monetary values	RM50, 3 ringgit, USD100

Parameters¶

Parameter	Type	Default	Description
`text`	`str \\| list[str]`	required	Input text
`spans`	`bool`	`False`	Return character offsets with each entity
`types`	`list[str]`	`None`	Filter to specific entity types only
`threshold`	`float`	`0.5`	Minimum confidence to include entity

Span Positions

mnlp.ner("Ahmad kerja kat Petronas", spans=True)
# [{'text': 'Ahmad', 'label': 'PERSON', 'start': 0, 'end': 5},
#  {'text': 'Petronas', 'label': 'ORG', 'start': 14, 'end': 22}]

Malaysian Names

The NER model recognises Malay patronymic patterns (bin/binti), Chinese multi-character names, and Indian names. Honorifics like Tan Sri, Dato', and Tun are grouped with the name.

`pos`¶

Part-of-speech tagging using Universal Dependencies (UD) adapted for Malay grammar. Correctly handles particles, code-switched words, and informal auxiliaries (nak, dah, boleh).

import malaysian_manglish_nlp as mnlp

mnlp.pos("Aku nak pergi makan kat kedai tu")
# [('Aku', 'PRON'), ('nak', 'AUX'), ('pergi', 'VERB'), ('makan', 'VERB'),
#  ('kat', 'ADP'), ('kedai', 'NOUN'), ('tu', 'DET')]

UD Tag Set¶

Tag	Description	Malay Examples
`NOUN`	Common noun	rumah, kereta, makanan
`VERB`	Action/state verb	makan, pergi, tidur
`ADJ`	Adjective	cantik, besar, sedap
`ADV`	Adverb	sangat, betul, cepat
`PRON`	Pronoun	aku, dia, mereka
`DET`	Determiner	tu, ini, semua
`ADP`	Preposition	kat, dari, untuk
`AUX`	Auxiliary	nak, dah, boleh, akan
`CONJ`	Conjunction	dan, tapi, atau
`PART`	Particle	la, je, kot, kan
`NUM`	Number	satu, 100, RM50
`PUNCT`	Punctuation	., !, ?
`INTJ`	Interjection	weh, eh, alamak

Parameters¶

Parameter	Type	Default	Description
`text`	`str`	required	Input text
`detailed`	`bool`	`False`	Include morphological features (person, number, tense)
`format`	`str`	`"list"`	Output format: `"list"`, `"dict"`

Detailed Morphological Features

mnlp.pos("Mereka sedang bermain", detailed=True)
# [('Mereka', 'PRON', {'person': 3, 'number': 'plur'}),
#  ('sedang', 'AUX', {'aspect': 'prog'}),
#  ('bermain', 'VERB', {'voice': 'act', 'root': 'main'})]

`dependency`¶

Dependency parsing for Malay and Manglish sentences. Returns syntactic relations between words in Universal Dependencies format.

import malaysian_manglish_nlp as mnlp

mnlp.dependency("Ali bagi buku tu kat Siti semalam")
# [('Ali', 'nsubj', 'bagi'),
#  ('bagi', 'ROOT', 'ROOT'),
#  ('buku', 'obj', 'bagi'),
#  ('tu', 'det', 'buku'),
#  ('kat', 'case', 'Siti'),
#  ('Siti', 'obl', 'bagi'),
#  ('semalam', 'obl:tmod', 'bagi')]

Parameters¶

Parameter	Type	Default	Description
`text`	`str`	required	Input sentence
`format`	`str`	`"list"`	Output format: `"list"`, `"tree"`, `"conllu"`
`detailed`	`bool`	`False`	Include POS tags and head indices
`visualize`	`bool`	`False`	Generate dependency tree image (requires matplotlib)

Tree Visualisation

tree = mnlp.dependency("Ali bagi buku tu kat Siti", format="tree")
print(tree)
# bagi (ROOT)
# ├── Ali (nsubj)
# ├── buku (obj)
# │   └── tu (det)
# └── Siti (obl)
#     └── kat (case)

Relation Labels

Uses Universal Dependencies v2 relation labels. Common ones: nsubj, obj, obl, det, amod, advmod, case. Full reference: UD docs.

`coreference`¶

Resolve pronouns and repeated mentions across sentences. Handles gender-neutral Malay pronouns (dia, ia, mereka).

import malaysian_manglish_nlp as mnlp

text = "Ahmad jumpa Siti kat mall. Dia cakap dia nak balik awal."
mnlp.coreference(text)
# {'clusters': [
#   [('Ahmad', 0, 5), ('Dia', 26, 29)],
#   [('Siti', 12, 16), ('dia', 34, 37)]
# ]}

Parameters¶

Parameter	Type	Default	Description
`text`	`str`	required	Multi-sentence input
`resolve`	`bool`	`False`	Replace pronouns with resolved entities
`chains`	`bool`	`False`	Return mention chains instead of clusters

Pronoun Resolution

mnlp.coreference(text, resolve=True)
# "Ahmad jumpa Siti kat mall. Ahmad cakap Siti nak balik awal."

Ambiguous Pronouns

Malay dia is gender-neutral. The model uses contextual cues (recency, topic, verb semantics) to resolve. For high-ambiguity texts, chains=True gives all candidates.

`keywords`¶

Extract keywords and key phrases using multiple algorithms. Ships with a Malaysian-specific stopword list that includes particles (la, je, kot) and common code-switch fillers.

import malaysian_manglish_nlp as mnlp

article = """Kerajaan Malaysia umumkan pakej rangsangan ekonomi bernilai
RM50 bilion untuk membantu rakyat dan perniagaan kecil yang terjejas."""

mnlp.keywords(article, top_k=3, scores=True)
# [('pakej rangsangan ekonomi', 0.92),
#  ('RM50 bilion', 0.87),
#  ('perniagaan kecil', 0.79)]

Parameters¶

Parameter	Type	Default	Description
`text`	`str`	required	Input text
`top_k`	`int`	`5`	Number of keywords to return
`scores`	`bool`	`False`	Include relevance scores
`method`	`str`	`"textrank"`	Algorithm: `"textrank"`, `"tfidf"`, `"yake"`
`ngram_range`	`tuple`	`(1, 3)`	Min/max n-gram size for key phrases

Method Comparison¶

Method	Speed	Best For
`textrank`	Medium	General-purpose, graph-based ranking
`tfidf`	Fast	Large corpora, statistical weighting
`yake`	Slow	Single documents, unsupervised extraction

N-gram Range

Use ngram_range=(2, 4) for longer key phrases; (1, 1) for single-word keywords only.

Extraction¶

Overview¶

Quick Start¶

Module Details¶

ner¶

Entity Types¶

Parameters¶

pos¶

UD Tag Set¶

Parameters¶

dependency¶

Parameters¶

coreference¶

Parameters¶

keywords¶

Parameters¶

Method Comparison¶

See Also¶

`ner`¶

`pos`¶

`dependency`¶

`coreference`¶

`keywords`¶