Indonesian Named Entity Recognition (NER) Model
Model Description
This is a custom Indonesian Named Entity Recognition (NER) model built with spaCy v3.8+. The model is designed to identify and classify 19 different types of named entities in Indonesian text, making it suitable for various NLP applications in the Indonesian language.
Model Details
- Model Name: ner_spacy_indonesian
- Version: 1.1.0
- Language: Indonesian (id)
- License: CC BY-SA 3.0
- Author: Asep Muhamad
- Email: [email protected]
- Website: https://asmud.me
- spaCy Version: >=3.8.0,<3.9.0
Architecture
- Pipeline Components: NER (Named Entity Recognition), Sentence Segmentation (disabled by default)
- Architecture: TransitionBasedParser with HashEmbedCNN token-to-vector model
- Token-to-Vector: HashEmbedCNN with 96-dimensional embeddings, 4-layer depth
- Hidden Width: 64 dimensions
- Training: Trained on Universal Dependencies v2.8 datasets
Entity Labels
The model recognizes 19 different entity types:
| Label | Description |
|---|---|
| CRD | Cardinal numbers |
| DAT | Dates |
| EVT | Events |
| FAC | Facilities |
| GPE | Geopolitical entities (countries, cities, states) |
| LAN | Languages |
| LAW | Laws |
| LOC | Locations |
| MON | Money/monetary values |
| NOR | Norms |
| ORD | Ordinal numbers |
| ORG | Organizations |
| PER | Persons |
| PRC | Processes |
| PRD | Products |
| QTY | Quantities |
| REG | Regions |
| TIM | Time |
| WOA | Works of art |
Performance
The model achieves strong performance on token-level evaluation:
- Token Accuracy: 98.59%
- Token Precision: 95.31%
- Token Recall: 95.72%
- Token F1-Score: 95.52%
- Sentence Precision: 90.67%
- Sentence Recall: 81.49%
- Sentence F1-Score: 85.83%
- Processing Speed: 66,612 tokens/second
Performance metrics are based on evaluation using Universal Dependencies v2.8 datasets with spaCy's standard evaluation framework.
Installation
You can install this model directly from the wheel file:
pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
Or download and install locally:
pip install id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
Usage
Basic Usage
import spacy
# Load the model
nlp = spacy.load("id_ner_spacy_indonesian")
# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(f"{ent.text:<20} {ent.label_:<10} {ent.start_char}-{ent.end_char}")
Advanced Usage
import spacy
from spacy import displacy
# Load model
nlp = spacy.load("id_ner_spacy_indonesian")
# Process text
text = """
Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun
pada tahun 2022. CEO BCA, Jahja Setiaatmadja, menyatakan bahwa kinerja
perseroan tetap solid di tengah tantangan ekonomi global.
"""
doc = nlp(text)
# Print detailed entity information
for ent in doc.ents:
print(f"Entity: {ent.text}")
print(f"Label: {ent.label_}")
print(f"Position: {ent.start_char}-{ent.end_char}")
print(f"Confidence: {ent._.score if hasattr(ent._, 'score') else 'N/A'}")
print("-" * 50)
# Visualize entities (in Jupyter notebook)
displacy.render(doc, style="ent", jupyter=True)
Batch Processing
import spacy
nlp = spacy.load("id_ner_spacy_indonesian")
# Process multiple texts
texts = [
"PT Telkom Indonesia adalah perusahaan telekomunikasi terbesar di Indonesia.",
"Universitas Indonesia terletak di Depok, Jawa Barat.",
"Presiden Susilo Bambang Yudhoyono menjabat dari tahun 2004 hingga 2014."
]
# Batch processing for efficiency
docs = list(nlp.pipe(texts))
for i, doc in enumerate(docs):
print(f"Text {i+1} entities:")
for ent in doc.ents:
print(f" {ent.text} ({ent.label_})")
print()
Model Training
This model was trained using:
- Data Source: Universal Dependencies v2.8 (multiple language datasets including Indonesian)
- Training Framework: spaCy v3.8+
- Optimization: Adam optimizer with gradient clipping
- Batch Size: Dynamic batching (100-1000 words)
- Training Steps: 100,000 maximum steps
- Dropout: 0.1
- Evaluation Frequency: Every 1,000 steps
Limitations
- The model is primarily trained on formal Indonesian text and may have reduced performance on informal or colloquial Indonesian
- Performance may vary on domain-specific texts not well represented in the training data
- Some entity boundaries might not be perfect, especially for complex compound entities
Citation
If you use this model in your research or applications, please cite:
@model{muhamad2024indonesian_ner,
title={Indonesian Named Entity Recognition Model},
author={Muhamad, Asep},
year={2024},
version={1.1.0},
url={https://huggingface.co/asmud/ner-spacy-indonesian},
license={CC BY-SA 3.0}
}
Contact
For questions, issues, or collaborations:
- Author: Asep Muhamad
- Email: [email protected]
- Website: https://asmud.me
Acknowledgments
This model was trained using data from Universal Dependencies v2.8, contributed by Daniel Zeman, Joakim Nivre, Mitchell Abrams, and many other contributors. Special thanks to the spaCy team for providing an excellent framework for natural language processing.
- Downloads last month
- 5