Indonesian Named Entity Recognition (NER) Model

Model Description

This is a custom Indonesian Named Entity Recognition (NER) model built with spaCy v3.8+. The model is designed to identify and classify 19 different types of named entities in Indonesian text, making it suitable for various NLP applications in the Indonesian language.

Model Details

  • Model Name: ner_spacy_indonesian
  • Version: 1.1.0
  • Language: Indonesian (id)
  • License: CC BY-SA 3.0
  • Author: Asep Muhamad
  • Email: [email protected]
  • Website: https://asmud.me
  • spaCy Version: >=3.8.0,<3.9.0

Architecture

  • Pipeline Components: NER (Named Entity Recognition), Sentence Segmentation (disabled by default)
  • Architecture: TransitionBasedParser with HashEmbedCNN token-to-vector model
  • Token-to-Vector: HashEmbedCNN with 96-dimensional embeddings, 4-layer depth
  • Hidden Width: 64 dimensions
  • Training: Trained on Universal Dependencies v2.8 datasets

Entity Labels

The model recognizes 19 different entity types:

Label Description
CRD Cardinal numbers
DAT Dates
EVT Events
FAC Facilities
GPE Geopolitical entities (countries, cities, states)
LAN Languages
LAW Laws
LOC Locations
MON Money/monetary values
NOR Norms
ORD Ordinal numbers
ORG Organizations
PER Persons
PRC Processes
PRD Products
QTY Quantities
REG Regions
TIM Time
WOA Works of art

Performance

The model achieves strong performance on token-level evaluation:

  • Token Accuracy: 98.59%
  • Token Precision: 95.31%
  • Token Recall: 95.72%
  • Token F1-Score: 95.52%
  • Sentence Precision: 90.67%
  • Sentence Recall: 81.49%
  • Sentence F1-Score: 85.83%
  • Processing Speed: 66,612 tokens/second

Performance metrics are based on evaluation using Universal Dependencies v2.8 datasets with spaCy's standard evaluation framework.

Installation

You can install this model directly from the wheel file:

pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/id_ner_spacy_indonesian-1.1.0-py3-none-any.whl

Or download and install locally:

pip install id_ner_spacy_indonesian-1.1.0-py3-none-any.whl

Usage

Basic Usage

import spacy

# Load the model
nlp = spacy.load("id_ner_spacy_indonesian")

# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<10} {ent.start_char}-{ent.end_char}")

Advanced Usage

import spacy
from spacy import displacy

# Load model
nlp = spacy.load("id_ner_spacy_indonesian")

# Process text
text = """
Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun 
pada tahun 2022. CEO BCA, Jahja Setiaatmadja, menyatakan bahwa kinerja 
perseroan tetap solid di tengah tantangan ekonomi global.
"""

doc = nlp(text)

# Print detailed entity information
for ent in doc.ents:
    print(f"Entity: {ent.text}")
    print(f"Label: {ent.label_}")
    print(f"Position: {ent.start_char}-{ent.end_char}")
    print(f"Confidence: {ent._.score if hasattr(ent._, 'score') else 'N/A'}")
    print("-" * 50)

# Visualize entities (in Jupyter notebook)
displacy.render(doc, style="ent", jupyter=True)

Batch Processing

import spacy

nlp = spacy.load("id_ner_spacy_indonesian")

# Process multiple texts
texts = [
    "PT Telkom Indonesia adalah perusahaan telekomunikasi terbesar di Indonesia.",
    "Universitas Indonesia terletak di Depok, Jawa Barat.",
    "Presiden Susilo Bambang Yudhoyono menjabat dari tahun 2004 hingga 2014."
]

# Batch processing for efficiency
docs = list(nlp.pipe(texts))

for i, doc in enumerate(docs):
    print(f"Text {i+1} entities:")
    for ent in doc.ents:
        print(f"  {ent.text} ({ent.label_})")
    print()

Model Training

This model was trained using:

  • Data Source: Universal Dependencies v2.8 (multiple language datasets including Indonesian)
  • Training Framework: spaCy v3.8+
  • Optimization: Adam optimizer with gradient clipping
  • Batch Size: Dynamic batching (100-1000 words)
  • Training Steps: 100,000 maximum steps
  • Dropout: 0.1
  • Evaluation Frequency: Every 1,000 steps

Limitations

  • The model is primarily trained on formal Indonesian text and may have reduced performance on informal or colloquial Indonesian
  • Performance may vary on domain-specific texts not well represented in the training data
  • Some entity boundaries might not be perfect, especially for complex compound entities

Citation

If you use this model in your research or applications, please cite:

@model{muhamad2024indonesian_ner,
  title={Indonesian Named Entity Recognition Model},
  author={Muhamad, Asep},
  year={2024},
  version={1.1.0},
  url={https://huggingface.co/asmud/ner-spacy-indonesian},
  license={CC BY-SA 3.0}
}

Contact

For questions, issues, or collaborations:

Acknowledgments

This model was trained using data from Universal Dependencies v2.8, contributed by Daniel Zeman, Joakim Nivre, Mitchell Abrams, and many other contributors. Special thanks to the spaCy team for providing an excellent framework for natural language processing.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train asmud/ner-spacy-indonesian