Contributing¶
Thanks for wanting to contribute to malaysian-manglish-nlp. This guide covers everything you need to know.
Development Setup¶
Prerequisites¶
- Python 3.10+
- Git
- pip or conda
Clone and install¶
git clone https://github.com/yourusername/malaysian-manglish-nlp.git
cd malaysian-manglish-nlp
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
Verify setup¶
# Run tests
pytest
# Run linters
ruff check .
mypy malaysian_manglish_nlp/
# Check formatting
ruff format --check .
If all three pass, you're good to go.
Code Style¶
We follow a strict but reasonable style:
| Tool | Purpose | Config |
|---|---|---|
| ruff | Linting + formatting | pyproject.toml |
| mypy | Type checking (strict) | mypy.ini |
| pytest | Testing | pytest.ini |
Rules¶
- Type hints everywhere. All public functions must have full type annotations.
- Docstrings. Google-style docstrings for all public modules, classes, and functions.
- No
import *. Explicit imports only. - Line length. 100 characters max.
- Naming.
snake_casefunctions/variables,PascalCaseclasses,UPPER_SNAKEconstants. - Error handling. Raise typed exceptions (
ManglishNLPErrorsubclasses), never bareexcept. - No print statements. Use
loggingmodule for debug output. - Test everything. Every public function needs at least 3 test cases (happy path, edge case, error).
Example¶
# Good
def tokenize(text: str, *, keep_punct: bool = False) -> list[str]:
"""Tokenize text into word list.
Args:
text: Input text to tokenize.
keep_punct: Whether to preserve punctuation tokens.
Returns:
List of word tokens.
Raises:
InputError: If text is empty or not a string.
"""
if not isinstance(text, str) or not text.strip():
raise InputError("text must be a non-empty string")
...
# Bad
def tokenize(text):
tokens = text.split() # no types, no docs, naive impl
return tokens
Adding a New Module¶
Step-by-step guide to adding a new NLP module.
1. Create the module file¶
malaysian_manglish_nlp/
├── your_module.py # Main implementation
├── tests/
│ └── test_your_module.py # Tests
└── docs/
└── your_module.md # Documentation (optional, API ref auto-generates)
2. Implement with standard interface¶
Every module must expose at minimum one callable function:
# malaysian_manglish_nlp/your_module.py
from __future__ import annotations
import logging
from typing import Any
from malaysian_manglish_nlp.exceptions import InputError, ModelError
logger = logging.getLogger(__name__)
def your_function(text: str, **kwargs: Any) -> dict[str, Any]:
"""One-line description.
Args:
text: Input text.
**kwargs: Additional options.
Returns:
Result dict with at least a 'score' or 'label' key.
Raises:
InputError: If input is invalid.
ModelError: If model inference fails.
"""
if not text or not text.strip():
raise InputError("text cannot be empty")
# Your logic here
result = _run_model(text, **kwargs)
return result
def _run_model(text: str, **kwargs: Any) -> dict[str, Any]:
"""Internal model inference. Private function."""
...
3. Register in __init__.py¶
# malaysian_manglish_nlp/__init__.py
from malaysian_manglish_nlp.your_module import your_function
__all__ = [
# ... existing exports
"your_function",
]
4. Write tests¶
# tests/test_your_module.py
import pytest
from malaysian_manglish_nlp import your_function
from malaysian_manglish_nlp.exceptions import InputError
class TestYourFunction:
def test_basic(self):
result = your_function("test input")
assert isinstance(result, dict)
assert "score" in result
def test_manglish_input(self):
result = your_function("Best gila lah")
assert result["score"] > 0
def test_empty_input_raises(self):
with pytest.raises(InputError):
your_function("")
def test_chinese_mixed(self):
result = your_function("这个 sangat bagus")
assert result is not None
# Add edge cases specific to your module
5. Add benchmarks (optional but recommended)¶
# benchmarks/bench_your_module.py
from malaysian_manglish_nlp import your_function
from benchmarks.utils import timed_run
def benchmark():
texts = load_test_corpus()
latency, throughput = timed_run(your_function, texts)
print(f"Latency: {latency:.2f}ms | Throughput: {throughput:.0f} tps")
if __name__ == "__main__":
benchmark()
6. Update documentation¶
Add your function to docs/api-reference.md following the existing format.
Adding Tests¶
Test structure¶
tests/
├── conftest.py # Shared fixtures
├── test_sentiment.py # One file per module
├── test_ner.py
├── test_pipeline.py
├── test_exceptions.py
└── integration/
├── test_full_pipeline.py
└── test_model_loading.py
Test requirements¶
- Minimum 3 tests per function: happy path, edge case, error case
- Manglish-specific inputs: Always test with real Manglish text
- Edge cases: Empty strings, very long text (>10k chars), special characters, mixed scripts
- Regression tests: When fixing a bug, add a test that would have caught it
Running tests¶
# All tests
pytest
# Single module
pytest tests/test_sentiment.py -v
# With coverage
pytest --cov=malaysian_manglish_nlp --cov-report=term-missing
# Parallel (faster for large suites)
pytest -n auto
Coverage target¶
We aim for 90%+ line coverage on all public modules. Check current coverage:
Pull Request Process¶
Before you submit¶
- Run full test suite -
pytestmust pass - Run linters -
ruff check .andmypy malaysian_manglish_nlp/must be clean - Format code -
ruff format . - Update docs - If you changed public API
- Add changelog entry - Describe your change in
docs/changelog.mdunder[Unreleased]
PR checklist¶
- [ ] Tests pass locally
- [ ] Linters pass
- [ ] New tests added for new functionality
- [ ] Docstrings updated
- [ ] API reference updated (if public API changed)
- [ ] Changelog entry added
- [ ] No hardcoded secrets or credentials
- [ ] Backward compatible (or migration path documented)
PR naming convention¶
type: short description
Examples:
feat: add emotion detection module
fix: sentiment misclassifies negated sarcasm
perf: 2x faster tokenization with pre-compiled regex
docs: improve normalize() examples
test: add edge cases for code-switching detection
refactor: simplify pipeline step chaining
Review process¶
- Auto-checks: CI runs tests + linters on push
- Review: At least 1 maintainer approval required
- Squash merge: All PRs squashed to single commit
- Release: Maintainers handle versioning and publishing
Reporting Bugs¶
Open a GitHub issue with:
## Bug Report
**Module**: `sentiment` / `ner` / etc.
**Version**: `malaysian-manglish-nlp==3.0.0`
**Python**: 3.11.7
**OS**: Windows 11 / Ubuntu 22.04 / macOS 14
### What happened
<clear description>
### Expected behavior
<what should have happened>
### Reproduction
```python
from malaysian_manglish_nlp import sentiment
sentiment("this caused the bug")
# Error: ...
Input that triggered it¶
Full traceback¶
Alternatives considered¶
Community¶
- GitHub Issues: Bug reports, feature requests
- Discussions: Questions, ideas, showcases
- Discord: Real-time chat with maintainers and users
Be respectful. We're all building something for the Malaysian NLP community.