Skip to content

Pipeline Usage

Chain multiple NLP modules into a single call - custom workflows, batch processing, and serializable pipelines.


Why pipelines?

Running 5 separate modules on the same text means 5 function calls, 5 result dicts to merge, and repeated boilerplate. Pipelines let you define a sequence of modules once and run them in a single call. Results are automatically merged into one structured output.


Load module

import malaysian_manglish_nlp as mnlp

# Default pipeline (normalize + sentiment + language + emotion)
result = mnlp.pipeline("gila best mkn dia")
print(result)
# {
#     'original': 'gila best mkn dia',
#     'normalized': 'gila best makan dia',
#     'sentiment': {'label': 'positive', 'score': 0.92},
#     'language': {'primary': 'manglish', ...},
#     'emotion': {'primary': 'joy', 'score': 0.85},
# }

Basic usage

Default pipeline

The default pipeline runs 4 modules: normalize, sentiment, language, emotion.

result = mnlp.pipeline("Weh best gila kedai tu!")

print(result['original'])
# Weh best gila kedai tu!

print(result['normalized'])
# Weh best gila kedai tu!

print(result['sentiment']['label'])
# positive

print(result['emotion']['primary'])
# joy

Custom steps

Choose which modules to run:

result = mnlp.pipeline(
    "Ahmad kerja kat Petronas KL",
    steps=['normalize', 'ner', 'pos']
)

print(result['ner'])
# [('Ahmad', 'PERSON'), ('Petronas', 'ORG'), ('KL', 'LOCATION')]

print(result['pos'])
# [('Ahmad', 'PROPN'), ('kerja', 'VERB'), ...]

Available steps

Step Module What it does
normalize mnlp.normalize() Expand shortforms
sentiment mnlp.sentiment() Sentiment analysis
language mnlp.detect_language() Language detection
emotion mnlp.detect_emotion() Emotion detection
profanity mnlp.detect_profanity() Profanity filter
dialect mnlp.detect_dialect() Dialect detection
sarcasm mnlp.detect_sarcasm() Sarcasm detection
tokenize mnlp.tokenize() Tokenization
stem mnlp.stem() Malay stemming
pos mnlp.pos_tag() POS tagging
ner mnlp.ner_tag() Named entities
segment mnlp.segment() Text segmentation
formalize mnlp.formalize() Formal BM
clean mnlp.clean() Noise removal
keywords mnlp.extract_keywords() Keyword extraction
correct mnlp.correct() Spelling correction
dependency mnlp.parse_dependencies() Dependency parsing
all everything Run all available steps

Run all modules

result = mnlp.pipeline("Weh best gila kedai tu!", steps=['all'])

# All module results in one dict:
# result['original']
# result['normalized']
# result['sentiment']
# result['language']
# result['emotion']
# result['profanity']
# result['dialect']
# result['sarcasm']
# result['pos']
# result['ner']
# result['keywords']
# ... and more

Performance with 'all'

Running all modules takes ~5ms per text. For high-throughput scenarios, only include the modules you need.


Normalize first

By default, text is normalized before other modules run. This improves accuracy for sentiment, NER, and language detection on informal text.

# Normalize first (default)
result = mnlp.pipeline("xpe la best gila mkn dia", normalize_first=True)
# normalized text used for sentiment, language, emotion

# Skip normalization
result = mnlp.pipeline("xpe la best gila mkn dia", normalize_first=False)
# raw text used for all modules

The analyze function

mnlp.analyze() is a shortcut for the full pipeline:

result = mnlp.analyze("Weh best gila kedai tu, tapi service slow sikit")

# Returns comprehensive analysis:
# - normalized text
# - sentiment (with aspect detection for mixed)
# - language detection
# - POS tags
# - named entities
# - emotion
# - keywords

Batch pipeline

Process multiple texts through the same pipeline:

texts = [
    "Best gila movie tu!",
    "Teruk la service dia",
    "Ahmad kerja kat Petronas KL",
    "Ambo nok make nasi kerabu",
]

results = mnlp.batch_pipeline(
    texts,
    steps=['normalize', 'sentiment', 'ner']
)

for text, result in zip(texts, results):
    sent = result['sentiment']['label']
    entities = result.get('ner', [])
    print(f"{text[:30]:30s}{sent:8s} | {entities}")

# Best gila movie tu!              → positive | []
# Teruk la service dia             → negative | []
# Ahmad kerja kat Petronas KL      → neutral  | [('Ahmad', 'PERSON'), ...]
# Ambo nok make nasi kerabu        → neutral  | []

Custom pipeline workflow

Build a custom analysis function:

def analyze_review(text):
    """Custom review analysis pipeline."""
    result = mnlp.pipeline(text, steps=['normalize', 'sentiment', 'emotion', 'keywords'])

    return {
        'text': result['normalized'],
        'sentiment': result['sentiment']['label'],
        'confidence': result['sentiment']['score'],
        'emotion': result['emotion']['primary'],
        'keywords': result['keywords'][:5],
        'is_positive': result['sentiment']['label'] == 'positive',
    }

# Use on a batch of reviews
reviews = [
    "Best gila makanan dia, sedap!",
    "Mahal sangat, tak berbaloi",
    "Service ok, food average je",
]

for review in reviews:
    analysis = analyze_review(review)
    print(f"{analysis['sentiment']:8s} | {analysis['emotion']:12s} | {analysis['text'][:30]}")

Content moderation pipeline

def moderate(text):
    """Content moderation pipeline."""
    result = mnlp.pipeline(text, steps=['normalize', 'profanity', 'sarcasm', 'sentiment'])

    flags = []
    if result.get('profanity', {}).get('is_profanity'):
        flags.append('profanity')
    if result.get('sarcasm', {}).get('is_sarcastic'):
        flags.append('sarcasm')
    if result.get('sentiment', {}).get('label') == 'negative':
        flags.append('negative')

    return {
        'approved': len(flags) == 0,
        'flags': flags,
        'clean_text': result['normalized'],
    }

CLI usage

# Full analysis (default pipeline)
$ mnlp analyze "Weh best gila kedai tu"

# With JSON output
$ mnlp analyze "Sedap nasi lemak" --json

# Pipe input
$ echo "Best gila!" | mnlp analyze

# Specific modules
$ mnlp sentiment "Best gila!" && mnlp ner "Ahmad kat KL"

# Benchmark
$ mnlp benchmark

Performance

Pipeline Latency Throughput
Default (4 modules) ~2ms 15,000 texts/sec
Minimal (2 modules) ~1ms 30,000 texts/sec
All modules ~5ms 6,000 texts/sec
Batch (100 texts, default) ~150ms 12,000 texts/sec

Optimization

  • Only include modules you need
  • Use batch_pipeline for multiple texts
  • Normalize first improves accuracy but adds ~0.2ms

See also