Tools & Utilities¶
Infrastructure for production NLP - pipelines, caching, profiling, evaluation, and hybrid routing.
Overview¶
Tools modules handle the engineering side: chaining modules into reusable pipelines, caching expensive operations, profiling latency and memory, evaluating model accuracy, tuning hyperparameters, and combining rule-based and ML approaches.
import malaysian_manglish_nlp as mnlp
from malaysian_manglish_nlp import Pipeline, cache, profiler, evaluate
Quick Start¶
from malaysian_manglish_nlp import Pipeline
# Build a reusable pipeline
pipe = Pipeline([
'clean',
'normalize',
'sentiment'
])
# Process
result = pipe("Weh @ahmad best gila mknn tu!! 🔥🔥")
# {'sentiment': {'label': 'positive', 'score': 0.93}}
# Batch with parallelism
results = pipe.batch(texts, n_jobs=4)
# Save for later
pipe.save("sentiment_pipeline.json")
Module Details¶
ocr_normalize¶
Post-process OCR output from Malaysian documents. Fixes common OCR artefacts: character substitutions (1→l, 0→o, rn→m), broken line breaks, and Malay-specific patterns.
import malaysian_manglish_nlp as mnlp
ocr_text = "Kerajaan Ma1aysia te1ah mengumumkan po1isi baru"
mnlp.ocr_normalize(ocr_text)
# "Kerajaan Malaysia telah mengumumkan polisi baru"
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Raw OCR output |
engine |
str |
"auto" |
OCR engine hint: "tesseract", "easyocr", "auto" |
fix_substitutions |
bool |
True |
Fix common character substitutions |
fix_linebreaks |
bool |
False |
Reconstruct broken line wraps |
threshold |
float |
0.5 |
Only correct tokens below this OCR confidence |
OCR Engine Targeting
Specify engine="tesseract" or "easyocr" for engine-specific fix patterns. Tesseract commonly confuses rn/m and cl/d; EasyOCR tends to merge adjacent words.
pipeline¶
Chain multiple modules into a reusable, serialisable processing pipeline. Supports custom config per step, conditional steps, and batch parallelism.
from malaysian_manglish_nlp import Pipeline
pipe = Pipeline([
'clean',
'normalize',
'tokenize',
'sentiment'
])
result = pipe("Weh @ahmad best gila mknn tu!! 🔥🔥")
# {'tokens': ['weh', 'best', 'gila', 'makanan', 'tu'],
# 'sentiment': {'label': 'positive', 'score': 0.93}}
Parameters (Pipeline constructor)¶
| Parameter | Type | Description |
|---|---|---|
steps |
list |
Module names or (name, config) tuples |
Parameters (Pipeline methods)¶
| Method | Parameters | Description |
|---|---|---|
__call__(text) |
return_all=False |
Process single text |
batch(texts) |
n_jobs=1 |
Parallel batch processing |
save(path) |
- | Serialise to JSON |
Pipeline.load(path) |
- | Load from JSON |
Custom Config Per Step
Conditional Steps
Intermediate Results
Performance
Pipelines avoid redundant computation by passing results between steps. pipe.batch(texts, n_jobs=4) uses process-level parallelism for CPU-bound modules.
calibration¶
Calibrate model confidence scores to produce reliable probability estimates. Essential for production systems that use confidence thresholds for routing or escalation.
from malaysian_manglish_nlp import calibration
calibrator = calibration("sentiment", method="platt")
# Before calibration (overconfident)
mnlp.sentiment("Best gila!")
# {'label': 'positive', 'score': 0.99}
# After calibration (realistic)
calibrator.calibrate(mnlp.sentiment("Best gila!"))
# {'label': 'positive', 'score': 0.82}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
required | Module name to calibrate |
method |
str |
"platt" |
Method: "platt", "isotonic", "temperature" |
Methods¶
| Method | Description |
|---|---|
calibrate(result) |
Apply calibration to a model output |
fit(texts, labels) |
Fit calibrator on labelled data |
ece() |
Expected Calibration Error (lower = better) |
evaluate¶
Evaluate NLP model performance with Malaysian-specific metrics. Covers classification, NER, and cross-validation with error analysis.
from malaysian_manglish_nlp import evaluate
results = evaluate.sentiment(
texts=test_texts,
labels=test_labels,
model=mnlp.sentiment
)
# {'accuracy': 0.87, 'f1_macro': 0.85, 'f1_weighted': 0.87,
# 'per_class': {'positive': 0.89, 'negative': 0.84, 'neutral': 0.81}}
Methods¶
| Method | Description |
|---|---|
evaluate.sentiment(texts, labels, model) |
Classification metrics |
evaluate.ner(texts, gold, model) |
NER precision/recall/F1 per entity type |
evaluate.cross_validate(texts, labels, model, folds) |
K-fold cross-validation |
evaluate.errors(texts, labels, model) |
Misclassified samples with confidence |
evaluate.report(texts, labels, model, output) |
Generate HTML classification report |
Error Analysis
hybrid_ml¶
Combine rule-based and ML approaches. Routes clear cases to fast rules, ambiguous cases to ML models. Ideal for latency-sensitive production systems.
from malaysian_manglish_nlp import hybrid_ml
model = hybrid_ml.create(
task="sentiment",
rules=my_rules,
ml_model=transformer,
threshold=0.7
)
# Clear case → rule fires (fast)
model("Best gila!")
# Uses rules → 0.2ms
# Ambiguous → ML fallback (accurate)
model("Hmm ok la tu...")
# Falls through to ML → 15ms
Parameters¶
| Parameter | Type | Description |
|---|---|---|
task |
str |
NLP task name |
rules |
callable |
Fast rule-based function |
ml_model |
callable |
ML model for fallback |
threshold |
float |
Use ML when rule confidence < threshold |
router |
callable |
Custom routing function (overrides threshold) |
tuning¶
Hyperparameter tuning for Malaysian NLP tasks. Supports grid search, random search, and Bayesian optimisation.
from malaysian_manglish_nlp import tuning
best_config = tuning.optimize(
task="sentiment",
train_data=train_texts,
train_labels=train_labels,
eval_data=eval_texts,
eval_labels=eval_labels,
n_trials=50
)
# {'model': 'transformer', 'lr': 2e-5, 'batch_size': 32, 'epochs': 3}
Methods¶
| Method | Speed | Best For |
|---|---|---|
grid_search(task, param_grid, data) |
Slow | Small search spaces, exhaustive coverage |
random_search(task, distributions, n_trials) |
Medium | Large search spaces, quick exploration |
bayesian(task, search_space, n_trials) |
Fast | Production tuning, expensive evaluations |
optimize(...) |
Adaptive | Auto-selects best strategy |
Early Stopping
profiler¶
Profile NLP pipeline performance - identify latency bottlenecks, memory usage, and throughput limits.
from malaysian_manglish_nlp import profiler
with profiler.trace() as p:
for text in texts[:100]:
mnlp.sentiment(text)
p.report()
# ┌─────────────┬──────────┬─────────┬──────────┐
# │ Step │ Avg (ms) │ Total │ % Time │
# ├─────────────┼──────────┼─────────┼──────────┤
# │ tokenize │ 0.3 │ 30ms │ 12% │
# │ encode │ 1.8 │ 180ms │ 72% │
# │ classify │ 0.4 │ 40ms │ 16% │
# └─────────────┴──────────┴─────────┴──────────┘
Methods¶
| Method | Description |
|---|---|
profiler.trace() |
Context manager for latency profiling |
profiler.memory() |
Context manager for peak memory tracking |
profiler.benchmark(fn, data, batch_sizes) |
Throughput at various batch sizes |
profiler.compare(models, data) |
Side-by-side model comparison |
Throughput Benchmark
cache¶
Cache expensive NLP operations. Supports memory, disk, and Redis backends with TTL, warm-up, and per-module clearing.
from malaysian_manglish_nlp import cache
# Built-in cache flag
mnlp.sentiment("Best gila!", cache=True) # computes
mnlp.sentiment("Best gila!", cache=True) # cached - instant
# Decorator for custom functions
@cache.memoize(ttl=3600)
def get_embedding(text):
return mnlp.embeddings(text)
Parameters (cache.configure)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
backend |
str |
"memory" |
"memory", "disk", or "redis" |
max_size |
str |
"512MB" |
Maximum cache size |
ttl |
int |
3600 |
Time-to-live in seconds |
Methods¶
| Method | Description |
|---|---|
cache.stats() |
Hit rate, size, hit/miss counts |
cache.clear() |
Clear entire cache |
cache.clear(module=) |
Clear specific module cache |
cache.warm(texts, modules=) |
Pre-populate cache |
When to Cache
- Always: embeddings, generation, translation (expensive, repeated)
- Sometimes: sentiment, NER (moderate cost, high repeat rate)
- Skip: tokenize, clean (sub-ms, caching overhead exceeds savings)
See Also¶
- Integrations - deploy pipelines as REST APIs
- Analysis - modules commonly used inside pipelines
- Benchmarks - full performance numbers across modules