Tools & Utilities¶

Infrastructure for production NLP - pipelines, caching, profiling, evaluation, and hybrid routing.

Overview¶

Tools modules handle the engineering side: chaining modules into reusable pipelines, caching expensive operations, profiling latency and memory, evaluating model accuracy, tuning hyperparameters, and combining rule-based and ML approaches.

import malaysian_manglish_nlp as mnlp
from malaysian_manglish_nlp import Pipeline, cache, profiler, evaluate

Quick Start¶

from malaysian_manglish_nlp import Pipeline

# Build a reusable pipeline
pipe = Pipeline([
    'clean',
    'normalize',
    'sentiment'
])

# Process
result = pipe("Weh @ahmad best gila mknn tu!! 🔥🔥")
# {'sentiment': {'label': 'positive', 'score': 0.93}}

# Batch with parallelism
results = pipe.batch(texts, n_jobs=4)

# Save for later
pipe.save("sentiment_pipeline.json")

Module Details¶

`ocr_normalize`¶

Post-process OCR output from Malaysian documents. Fixes common OCR artefacts: character substitutions (1→l, 0→o, rn→m), broken line breaks, and Malay-specific patterns.

import malaysian_manglish_nlp as mnlp

ocr_text = "Kerajaan Ma1aysia te1ah mengumumkan po1isi baru"
mnlp.ocr_normalize(ocr_text)
# "Kerajaan Malaysia telah mengumumkan polisi baru"

Parameters¶

Parameter	Type	Default	Description
`text`	`str`	required	Raw OCR output
`engine`	`str`	`"auto"`	OCR engine hint: `"tesseract"`, `"easyocr"`, `"auto"`
`fix_substitutions`	`bool`	`True`	Fix common character substitutions
`fix_linebreaks`	`bool`	`False`	Reconstruct broken line wraps
`threshold`	`float`	`0.5`	Only correct tokens below this OCR confidence

OCR Engine Targeting

Specify engine="tesseract" or "easyocr" for engine-specific fix patterns. Tesseract commonly confuses rn/m and cl/d; EasyOCR tends to merge adjacent words.

`pipeline`¶

Chain multiple modules into a reusable, serialisable processing pipeline. Supports custom config per step, conditional steps, and batch parallelism.

from malaysian_manglish_nlp import Pipeline

pipe = Pipeline([
    'clean',
    'normalize',
    'tokenize',
    'sentiment'
])

result = pipe("Weh @ahmad best gila mknn tu!! 🔥🔥")
# {'tokens': ['weh', 'best', 'gila', 'makanan', 'tu'],
#  'sentiment': {'label': 'positive', 'score': 0.93}}

Parameters (Pipeline constructor)¶

Parameter	Type	Description
`steps`	`list`	Module names or `(name, config)` tuples

Parameters (Pipeline methods)¶

Method	Parameters	Description
`__call__(text)`	`return_all=False`	Process single text
`batch(texts)`	`n_jobs=1`	Parallel batch processing
`save(path)`	-	Serialise to JSON
`Pipeline.load(path)`	-	Load from JSON

Custom Config Per Step

pipe = Pipeline([
    ('clean', {'keep_emoji': False}),
    ('normalize', {'aggressive': True}),
    ('sentiment', {'detailed': True})
])

Conditional Steps

pipe = Pipeline([
    'clean',
    ('normalize', {'condition': lambda x: len(x) > 10}),
    'sentiment'
])

Intermediate Results

result = pipe(text, return_all=True)
# {'clean': '...', 'normalize': '...', 'sentiment': {...}}

Performance

Pipelines avoid redundant computation by passing results between steps. pipe.batch(texts, n_jobs=4) uses process-level parallelism for CPU-bound modules.

`calibration`¶

Calibrate model confidence scores to produce reliable probability estimates. Essential for production systems that use confidence thresholds for routing or escalation.

from malaysian_manglish_nlp import calibration

calibrator = calibration("sentiment", method="platt")

# Before calibration (overconfident)
mnlp.sentiment("Best gila!")
# {'label': 'positive', 'score': 0.99}

# After calibration (realistic)
calibrator.calibrate(mnlp.sentiment("Best gila!"))
# {'label': 'positive', 'score': 0.82}

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	Module name to calibrate
`method`	`str`	`"platt"`	Method: `"platt"`, `"isotonic"`, `"temperature"`

Methods¶

Method	Description
`calibrate(result)`	Apply calibration to a model output
`fit(texts, labels)`	Fit calibrator on labelled data
`ece()`	Expected Calibration Error (lower = better)

Evaluate Calibration Quality

calibrator.ece()
# 0.03  (near-perfect calibration)

`evaluate`¶

Evaluate NLP model performance with Malaysian-specific metrics. Covers classification, NER, and cross-validation with error analysis.

from malaysian_manglish_nlp import evaluate

results = evaluate.sentiment(
    texts=test_texts,
    labels=test_labels,
    model=mnlp.sentiment
)
# {'accuracy': 0.87, 'f1_macro': 0.85, 'f1_weighted': 0.87,
#  'per_class': {'positive': 0.89, 'negative': 0.84, 'neutral': 0.81}}

Methods¶

Method	Description
`evaluate.sentiment(texts, labels, model)`	Classification metrics
`evaluate.ner(texts, gold, model)`	NER precision/recall/F1 per entity type
`evaluate.cross_validate(texts, labels, model, folds)`	K-fold cross-validation
`evaluate.errors(texts, labels, model)`	Misclassified samples with confidence
`evaluate.report(texts, labels, model, output)`	Generate HTML classification report

Error Analysis

errors = evaluate.errors(test_texts, test_labels, model=mnlp.sentiment)
# [{'text': 'Hmm ok la tu...', 'predicted': 'positive',
#   'actual': 'negative', 'confidence': 0.51}]

`hybrid_ml`¶

Combine rule-based and ML approaches. Routes clear cases to fast rules, ambiguous cases to ML models. Ideal for latency-sensitive production systems.

from malaysian_manglish_nlp import hybrid_ml

model = hybrid_ml.create(
    task="sentiment",
    rules=my_rules,
    ml_model=transformer,
    threshold=0.7
)

# Clear case → rule fires (fast)
model("Best gila!")
# Uses rules → 0.2ms

# Ambiguous → ML fallback (accurate)
model("Hmm ok la tu...")
# Falls through to ML → 15ms

Parameters¶

Parameter	Type	Description
`task`	`str`	NLP task name
`rules`	`callable`	Fast rule-based function
`ml_model`	`callable`	ML model for fallback
`threshold`	`float`	Use ML when rule confidence < threshold
`router`	`callable`	Custom routing function (overrides threshold)

Performance Stats

model.stats()
# {'rule_hits': 7823, 'ml_hits': 2177, 'avg_latency_ms': 3.2}

`tuning`¶

Hyperparameter tuning for Malaysian NLP tasks. Supports grid search, random search, and Bayesian optimisation.

from malaysian_manglish_nlp import tuning

best_config = tuning.optimize(
    task="sentiment",
    train_data=train_texts,
    train_labels=train_labels,
    eval_data=eval_texts,
    eval_labels=eval_labels,
    n_trials=50
)
# {'model': 'transformer', 'lr': 2e-5, 'batch_size': 32, 'epochs': 3}

Methods¶

Method	Speed	Best For
`grid_search(task, param_grid, data)`	Slow	Small search spaces, exhaustive coverage
`random_search(task, distributions, n_trials)`	Medium	Large search spaces, quick exploration
`bayesian(task, search_space, n_trials)`	Fast	Production tuning, expensive evaluations
`optimize(...)`	Adaptive	Auto-selects best strategy

Early Stopping

tuning.optimize(task, data=data, early_stopping=True, patience=5)
# Stops when eval metric doesn't improve for 5 trials

`profiler`¶

Profile NLP pipeline performance - identify latency bottlenecks, memory usage, and throughput limits.

from malaysian_manglish_nlp import profiler

with profiler.trace() as p:
    for text in texts[:100]:
        mnlp.sentiment(text)

p.report()
# ┌─────────────┬──────────┬─────────┬──────────┐
# │ Step        │ Avg (ms) │ Total   │ % Time   │
# ├─────────────┼──────────┼─────────┼──────────┤
# │ tokenize    │ 0.3      │ 30ms    │ 12%      │
# │ encode      │ 1.8      │ 180ms   │ 72%      │
# │ classify    │ 0.4      │ 40ms    │ 16%      │
# └─────────────┴──────────┴─────────┴──────────┘

Methods¶

Method	Description
`profiler.trace()`	Context manager for latency profiling
`profiler.memory()`	Context manager for peak memory tracking
`profiler.benchmark(fn, data, batch_sizes)`	Throughput at various batch sizes
`profiler.compare(models, data)`	Side-by-side model comparison

Memory Profiling

with profiler.memory() as p:
    mnlp.embeddings(large_corpus)
p.peak_mb
# 245.3

Throughput Benchmark

profiler.benchmark(mnlp.sentiment, texts, batch_sizes=[1, 8, 32, 64])
# {1: '2,300 texts/sec', 8: '12,400 texts/sec',
#  32: '23,100 texts/sec', 64: '24,800 texts/sec'}

`cache`¶

Cache expensive NLP operations. Supports memory, disk, and Redis backends with TTL, warm-up, and per-module clearing.

from malaysian_manglish_nlp import cache

# Built-in cache flag
mnlp.sentiment("Best gila!", cache=True)    # computes
mnlp.sentiment("Best gila!", cache=True)    # cached  -  instant

# Decorator for custom functions
@cache.memoize(ttl=3600)
def get_embedding(text):
    return mnlp.embeddings(text)

Parameters (cache.configure)¶

Parameter	Type	Default	Description
`backend`	`str`	`"memory"`	`"memory"`, `"disk"`, or `"redis"`
`max_size`	`str`	`"512MB"`	Maximum cache size
`ttl`	`int`	`3600`	Time-to-live in seconds

Methods¶

Method	Description
`cache.stats()`	Hit rate, size, hit/miss counts
`cache.clear()`	Clear entire cache
`cache.clear(module=)`	Clear specific module cache
`cache.warm(texts, modules=)`	Pre-populate cache

Cache Stats

cache.stats()
# {'hits': 4521, 'misses': 892, 'hit_rate': 0.84, 'size_mb': 123}

When to Cache

Always: embeddings, generation, translation (expensive, repeated)
Sometimes: sentiment, NER (moderate cost, high repeat rate)
Skip: tokenize, clean (sub-ms, caching overhead exceeds savings)

Tools & Utilities¶

Overview¶

Quick Start¶

Module Details¶

ocr_normalize¶

Parameters¶

pipeline¶

Parameters (Pipeline constructor)¶

Parameters (Pipeline methods)¶

calibration¶

Parameters¶

Methods¶

evaluate¶

Methods¶

hybrid_ml¶

Parameters¶

tuning¶

Methods¶

profiler¶

Methods¶

cache¶

Parameters (cache.configure)¶

Methods¶

See Also¶

`ocr_normalize`¶

`pipeline`¶

`calibration`¶

`evaluate`¶

`hybrid_ml`¶

`tuning`¶

`profiler`¶

`cache`¶