Advanced NLP¶
Complex linguistic analysis - code-switching, intent, topic, hate speech, stance, and discourse structure.
Overview¶
Advanced modules handle higher-order linguistic phenomena unique to Malaysian multilingual text. These require the [ml] extra and are designed for production chatbots, content moderation systems, and research applications.
Quick Start¶
import malaysian_manglish_nlp as mnlp
# Code-switching detection
mnlp.code_switching("I think kita should go makan first, then baru discuss")
# {'switches': 4, 'pattern': 'intra-sentential',
# 'segments': [('I think', 'en'), ('kita', 'ms'), ('should go', 'en'),
# ('makan', 'ms'), ('first, then', 'en'), ('baru', 'ms'), ('discuss', 'en')]}
# Intent classification
mnlp.intent("Nak tanya, kedai tu bukak pukul berapa eh?")
# {'intent': 'question_info', 'confidence': 0.91,
# 'slots': {'entity': 'kedai', 'attribute': 'operating_hours'}}
# Hate speech moderation
mnlp.hate_speech("Semua kaum X memang sampah masyarakat")
# {'is_hate': True, 'target': 'race', 'severity': 'high', 'confidence': 0.94}
Module Details¶
code_switching¶
Detect and analyse code-switching patterns between languages. Identifies switch points, matrix language, and switching type.
import malaysian_manglish_nlp as mnlp
text = "I think kita should go makan first, then baru discuss"
mnlp.code_switching(text)
# {'switches': 4, 'pattern': 'intra-sentential',
# 'segments': [('I think', 'en'), ('kita', 'ms'), ('should go', 'en'),
# ('makan', 'ms'), ('first, then', 'en'), ('baru', 'ms'), ('discuss', 'en')]}
Switching Types¶
| Type | Description | Example |
|---|---|---|
inter-sentential |
Switch between sentences | "Best movie. Tapi ending hampeh." |
intra-sentential |
Switch within a sentence | "I rasa macam nak pergi" |
tag-switching |
Insert particles/tags | "Good la, very nice right?" |
intra-word |
Morpheme mixing | "download-kan", "upload-lah" |
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input text |
points |
bool |
False |
Return detailed switch point analysis |
matrix |
bool |
False |
Identify matrix vs embedded language |
classify |
bool |
False |
Classify switching type |
Switch Point Analysis
Matrix Language
intent¶
Classify user intent for chatbots and dialogue systems. Returns intent label, confidence, and extracted slots.
import malaysian_manglish_nlp as mnlp
mnlp.intent("Nak order 2 nasi lemak extra sambal")
# {'intent': 'request_action', 'confidence': 0.89,
# 'slots': {'item': 'nasi lemak', 'quantity': 2, 'modifier': 'extra sambal'}}
Intent Categories¶
| Intent | Description | Example |
|---|---|---|
question_info |
Asking for information | "Berapa harga tu?" |
request_action |
Requesting an action | "Tolong bukakkan pintu" |
complaint |
Expressing dissatisfaction | "Service teruk la kat sini" |
greeting |
Opening a conversation | "Assalamualaikum, apa khabar?" |
farewell |
Closing a conversation | "Ok la, jumpa nanti" |
confirmation |
Agreeing / confirming | "Ok boleh, set" |
negation |
Declining / rejecting | "Taknak la, mahal sangat" |
opinion |
Expressing a view | "Aku rasa best gila movie tu" |
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input utterance |
multi |
bool |
False |
Detect multiple intents |
labels |
list[str] |
None |
Custom intent labels (override defaults) |
slots |
bool |
True |
Extract slot values |
Custom Labels for Domain Bots
topic¶
Topic classification and unsupervised topic modelling for Malaysian text.
import malaysian_manglish_nlp as mnlp
mnlp.topic("Harga minyak naik lagi, memang susah rakyat nak survive")
# {'topic': 'economy', 'subtopic': 'cost_of_living', 'confidence': 0.87}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str \| list[str] |
required | Input text or corpus |
top_k |
int |
1 |
Number of topics to return |
labels |
list[str] |
None |
Custom topic labels |
mode |
str |
"classify" |
"classify" (single text) or "model" (corpus clustering) |
n_topics |
int |
10 |
Number of topics for unsupervised modelling |
Multi-Topic Classification
Unsupervised Topic Modelling
hate_speech¶
Detect hate speech and offensive content targeting Malaysian communities. Understands local slurs, coded language, and dog whistles specific to the Malaysian context.
import malaysian_manglish_nlp as mnlp
mnlp.hate_speech("Semua bangsa X memang macam tu, tak boleh dipercayai")
# {'is_hate': True, 'target': 'race', 'severity': 'high', 'confidence': 0.92}
Severity Levels¶
| Level | Description | Example |
|---|---|---|
low |
Offensive but not dehumanising | Casual slurs among peers |
medium |
Stereotyping, generalisation | "Semua orang X memang pemalas" |
high |
Dehumanising, inciting hatred | Calls for exclusion or violence |
Target Categories¶
race, religion, gender, nationality, disability, sexual_orientation
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input text |
detailed |
bool |
False |
Include type classification and entity spans |
granular |
bool |
False |
Distinguish hate vs offensive vs neither |
moderate |
bool |
False |
Return moderation action recommendation |
Content Moderation Mode
Sensitive Content
This module processes hate speech for detection purposes. It does not generate or endorse such content. Use responsibly for moderation systems.
stance¶
Detect stance (support / oppose / neutral) toward a target topic or claim.
import malaysian_manglish_nlp as mnlp
mnlp.stance("Memang patut la naikkan gaji minimum, dah lama tak naik",
target="minimum wage increase")
# {'stance': 'support', 'confidence': 0.88}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input text |
target |
str |
None |
Target topic (auto-detected if omitted) |
targets |
list[str] |
None |
Check stance against multiple targets |
explain |
bool |
False |
Return linguistic cues driving the classification |
Multi-Target Stance
Stance with Explanation
discourse¶
Analyse discourse structure and rhetorical relations in text using Rhetorical Structure Theory (RST) adapted for Malay.
import malaysian_manglish_nlp as mnlp
text = "Walaupun hujan lebat, Ahmad tetap pergi kerja sebab deadline esok."
mnlp.discourse(text)
# {'relations': [
# {'type': 'concession', 'arg1': 'hujan lebat', 'arg2': 'Ahmad tetap pergi kerja'},
# {'type': 'cause', 'arg1': 'deadline esok', 'arg2': 'pergi kerja'}
# ]}
Supported Relations¶
| Relation | Malay Connectives | Example |
|---|---|---|
cause |
sebab, kerana | "Dia marah sebab lambat" |
contrast |
tapi, tetapi | "Mahal tapi berbaloi" |
concession |
walaupun, biar pun | "Walaupun penat, dia teruskan" |
elaboration |
iaitu, misalnya | "Buah tropika, misalnya durian" |
condition |
kalau, jika | "Kalau hujan, bawa payung" |
temporal |
lepas, sebelum, sambil | "Lepas makan, dia tidur" |
purpose |
supaya, untuk | "Belajar rajin supaya lulus" |
result |
maka, jadi | "Hujan lebat, jadi banjir" |
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input text |
format |
str |
"list" |
Output format: "list", "tree" |
connectives |
bool |
False |
Return detected connectives with positions |
Connective Detection
See Also¶
- Analysis - sentiment, emotion, and sarcasm detection
- Extraction - NER, POS, and dependency parsing
- Intent + Pipeline - chain intent detection with slot extraction for chatbots
- Evaluate - benchmark classification accuracy on your data