Indonesian Named Entity Recognition (NER) Model

Model Description

This is a custom Indonesian Named Entity Recognition (NER) model built with spaCy v3.8+. The model is designed to identify and classify 19 different types of named entities in Indonesian text, making it suitable for various NLP applications in the Indonesian language.

Model Details

Model Name: ner_spacy_indonesian
Version: 1.1.0
Language: Indonesian (id)
License: CC BY-SA 3.0
Author: Asep Muhamad
Email: [email protected]
Website: https://asmud.me
spaCy Version: >=3.8.0,<3.9.0

Architecture

Pipeline Components: NER (Named Entity Recognition), Sentence Segmentation (disabled by default)
Architecture: TransitionBasedParser with HashEmbedCNN token-to-vector model
Token-to-Vector: HashEmbedCNN with 96-dimensional embeddings, 4-layer depth
Hidden Width: 64 dimensions
Training: Trained on Universal Dependencies v2.8 datasets

Entity Labels

The model recognizes 19 different entity types:

Label	Description
CRD	Cardinal numbers
DAT	Dates
EVT	Events
FAC	Facilities
GPE	Geopolitical entities (countries, cities, states)
LAN	Languages
LAW	Laws
LOC	Locations
MON	Money/monetary values
NOR	Norms
ORD	Ordinal numbers
ORG	Organizations
PER	Persons
PRC	Processes
PRD	Products
QTY	Quantities
REG	Regions
TIM	Time
WOA	Works of art

Performance

The model achieves strong performance on token-level evaluation:

Token Accuracy: 98.59%
Token Precision: 95.31%
Token Recall: 95.72%
Token F1-Score: 95.52%
Sentence Precision: 90.67%
Sentence Recall: 81.49%
Sentence F1-Score: 85.83%
Processing Speed: 66,612 tokens/second

Performance metrics are based on evaluation using Universal Dependencies v2.8 datasets with spaCy's standard evaluation framework.

Installation

You can install this model directly from the wheel file:

pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/id_ner_spacy_indonesian-1.1.0-py3-none-any.whl

Or download and install locally:

pip install id_ner_spacy_indonesian-1.1.0-py3-none-any.whl

Usage

Basic Usage

import spacy

# Load the model
nlp = spacy.load("id_ner_spacy_indonesian")

# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<10} {ent.start_char}-{ent.end_char}")

Advanced Usage

import spacy
from spacy import displacy

# Load model
nlp = spacy.load("id_ner_spacy_indonesian")

# Process text
text = """
Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun 
pada tahun 2022. CEO BCA, Jahja Setiaatmadja, menyatakan bahwa kinerja 
perseroan tetap solid di tengah tantangan ekonomi global.
"""

doc = nlp(text)

# Print detailed entity information
for ent in doc.ents:
    print(f"Entity: {ent.text}")
    print(f"Label: {ent.label_}")
    print(f"Position: {ent.start_char}-{ent.end_char}")
    print(f"Confidence: {ent._.score if hasattr(ent._, 'score') else 'N/A'}")
    print("-" * 50)

# Visualize entities (in Jupyter notebook)
displacy.render(doc, style="ent", jupyter=True)

Batch Processing

import spacy

nlp = spacy.load("id_ner_spacy_indonesian")

# Process multiple texts
texts = [
    "PT Telkom Indonesia adalah perusahaan telekomunikasi terbesar di Indonesia.",
    "Universitas Indonesia terletak di Depok, Jawa Barat.",
    "Presiden Susilo Bambang Yudhoyono menjabat dari tahun 2004 hingga 2014."
]

# Batch processing for efficiency
docs = list(nlp.pipe(texts))

for i, doc in enumerate(docs):
    print(f"Text {i+1} entities:")
    for ent in doc.ents:
        print(f"  {ent.text} ({ent.label_})")
    print()

Model Training

This model was trained using:

Data Source: Universal Dependencies v2.8 (multiple language datasets including Indonesian)
Training Framework: spaCy v3.8+
Optimization: Adam optimizer with gradient clipping
Batch Size: Dynamic batching (100-1000 words)
Training Steps: 100,000 maximum steps
Dropout: 0.1
Evaluation Frequency: Every 1,000 steps

Limitations

The model is primarily trained on formal Indonesian text and may have reduced performance on informal or colloquial Indonesian
Performance may vary on domain-specific texts not well represented in the training data
Some entity boundaries might not be perfect, especially for complex compound entities

Citation

If you use this model in your research or applications, please cite:

@model{muhamad2024indonesian_ner,
  title={Indonesian Named Entity Recognition Model},
  author={Muhamad, Asep},
  year={2024},
  version={1.1.0},
  url={https://huggingface.co/asmud/ner-spacy-indonesian},
  license={CC BY-SA 3.0}
}

Contact

For questions, issues, or collaborations:

Author: Asep Muhamad
Email: [email protected]
Website: https://asmud.me

Acknowledgments

This model was trained using data from Universal Dependencies v2.8, contributed by Daniel Zeman, Joakim Nivre, Mitchell Abrams, and many other contributors. Special thanks to the spaCy team for providing an excellent framework for natural language processing.

Downloads last month: 5

asmud
/

ner-spacy-indonesian