Generation¶
Translate, summarise, generate, and answer questions in Malaysian text.
Overview¶
Generation modules produce or transform text: translation between languages, summarisation of long documents, controlled text generation, and question answering. All require the [ml] extra for transformer-backed models.
ML Dependency Required
All generation modules require transformer models. First call downloads the model (~200 MB). Subsequent calls use cached weights.
Quick Start¶
import malaysian_manglish_nlp as mnlp
# Translate Manglish → English
mnlp.translate("Weh best gila movie tu bro", target="en")
# "Hey, that movie was really great, bro"
# Summarise
article = "Kerajaan umum pakej RM50B... (long text)"
mnlp.summarize(article, max_length=30)
# "Kerajaan umum pakej rangsangan RM50B merangkumi bantuan tunai, moratorium, dan subsidi upah."
# Question answering
mnlp.qa("Bila UMP ditubuhkan?", context="UMP ditubuhkan pada tahun 2002 di Gambang.")
# {'answer': '2002', 'confidence': 0.95}
Module Details¶
translation¶
Translate between Bahasa Melayu, English, and Manglish. Supports entity preservation and register-aware output.
import malaysian_manglish_nlp as mnlp
# BM → English
mnlp.translate("Aku nak pergi makan nasi lemak", target="en")
# "I want to go eat nasi lemak"
# English → BM
mnlp.translate("The weather is really nice today", target="ms")
# "Cuaca hari ini sangat cantik"
# Manglish → formal BM
mnlp.translate("Weh best gila movie tu bro", target="ms_formal")
# "Filem itu sangat bagus"
# Formal → Manglish (natural Malaysian style)
mnlp.translate("Filem itu sangat bagus", target="manglish")
# "Movie tu memang best gila"
Translation Directions¶
| From \ To | en |
ms |
ms_formal |
manglish |
|---|---|---|---|---|
| BM | ✅ | - | ✅ | ✅ |
| English | - | ✅ | ✅ | ✅ |
| Manglish | ✅ | ✅ | ✅ | - |
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str \| list[str] |
required | Input text or batch |
target |
str |
required | Target language code |
preserve_entities |
bool |
True |
Keep names and places untranslated |
informal |
bool |
False |
Use informal register in output |
alternatives |
int |
1 |
Number of translation variants to return |
Alternative Translations
Manglish Preservation
target="manglish" produces text that sounds like natural Malaysian speech - not word-for-word translation. Useful for chatbot responses targeting Malaysian users.
summarization¶
Summarise Malaysian text while preserving key information. Supports extractive (select key sentences) and abstractive (generate new summary) methods.
import malaysian_manglish_nlp as mnlp
article = """
Kerajaan Malaysia hari ini mengumumkan pakej rangsangan ekonomi bernilai
RM50 bilion untuk membantu rakyat dan perniagaan kecil yang terjejas.
Perdana Menteri berkata pakej ini merangkumi bantuan tunai langsung,
moratorium pinjaman, dan subsidi upah untuk pekerja. Beliau juga
mengumumkan pengurangan cukai untuk PKS selama 6 bulan.
"""
mnlp.summarize(article)
# "Kerajaan umum pakej rangsangan RM50B - bantuan tunai, moratorium,
# subsidi upah, dan pengurangan cukai PKS 6 bulan."
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input document |
max_length |
int |
None |
Target word count for summary |
ratio |
float |
0.3 |
Summary length as fraction of original |
method |
str |
"abstractive" |
"extractive" or "abstractive" |
format |
str |
"text" |
Output format: "text" or "bullets" |
lang |
str |
None |
Force output language (cross-lingual summary) |
Bullet Point Summary
Cross-Lingual Summary
Extractive vs Abstractive
- Extractive: faster, picks exact sentences from source - good for factual accuracy
- Abstractive: slower, generates new sentences - more concise and readable
text_generation¶
Generate Malaysian text with controllable style, format, and creativity level.
import malaysian_manglish_nlp as mnlp
mnlp.generate("Tulis review restoran nasi lemak", max_length=100)
# "Nasi lemak kat kedai ni memang power. Sambal dia pedas just nice,
# ikan bilis rangup, dan nasi tu wangi gila. Portion pun besar.
# Confirm balik lagi next time."
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
required | Generation prompt or seed text |
max_length |
int |
100 |
Maximum output tokens |
style |
str |
"manglish" |
Output style: "formal", "manglish", "mixed" |
temperature |
float |
0.7 |
Creativity (0.1 = deterministic, 1.0 = creative) |
format |
str |
"text" |
Output format: "text", "tweet", "review", "caption" |
mode |
str |
"generate" |
"generate" or "continue" (extend existing text) |
Temperature Comparison
# Low temperature - predictable, focused
mnlp.generate("Nasi lemak is", temperature=0.2, max_length=20)
# "Nasi lemak is a traditional Malaysian dish made with coconut rice."
# High temperature - creative, varied
mnlp.generate("Nasi lemak is", temperature=0.9, max_length=20)
# "Nasi lemak is basically Malaysia's hug on a plate, no?"
Continuation Mode
qa¶
Question answering over Malaysian text. Supports extractive QA (find answer span in context), open-domain QA (no context), and conversational sessions with pronoun resolution.
import malaysian_manglish_nlp as mnlp
context = """
Universiti Malaysia Pahang (UMP) ditubuhkan pada tahun 2002.
Kampus utama terletak di Gambang, Pahang. UMP mempunyai lebih
10,000 pelajar dan menawarkan program dalam bidang kejuruteraan,
sains komputer, dan teknologi.
"""
mnlp.qa("Bila UMP ditubuhkan?", context=context)
# {'answer': '2002', 'confidence': 0.95, 'span': (46, 50)}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
question |
str |
required | Question in BM, EN, or Manglish |
context |
str |
None |
Source document (omit for open-domain) |
top_k |
int |
1 |
Number of answer candidates |
session |
bool |
False |
Enable conversational mode with pronoun resolution |
Cross-Lingual QA
Multi-Answer Extraction
Conversational Session
Language Handling
Questions can be in BM, English, or Manglish regardless of context language. The model handles cross-lingual QA natively - no translation step needed.
See Also¶
- Text Processing - preprocess text before translation or QA
- Embeddings - use sentence embeddings for retrieval-augmented QA
- Pipeline - chain QA with document retrieval
- Cache - cache generation results for repeated prompts