---
datasets:
- pythainlp/thainer-corpus-v2
language:
- th
base_model:
- clicknext/phayathaibert
pipeline_tag: token-classification
library_name: transformers
tags:
- medical
---
# No Name Thai NER

<!-- ![Mascot image](mascot-image-landscape.png)-->
<img src="mascot-image-landscape.png" alt="mascot" style="width: 600px; height: auto; display: block; margin: 0 auto;">
<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 20px;">
    <img src="Looloohealth.png" alt="Looloo Health" style="width: 250px; height: auto;">
    <img src="PresScribe.png" alt="Prescribe" style="width: 250px; height: auto;">
</div>


Compact Thai token-classification model optimized for fast named-entity recognition (NER) and practical medical-text deidentification. This checkpoint was trained for robust entity detection on Thai clinical and conversational text and is intended for use in context-preserving anonymization pipelines.

At [**Looloo Health**](https://looloohealth.com/en/), we're passionate about making healthcare more accessible and affordable for everyone. 
The model is a core component of our AI Medical Scribe, [**PresScribe**](https://www.youtube.com/watch?v=oUiJ9oPgZMA), where it helps ensure patient privacy through automated de-identification.
We believe that unlocking the potential of clinical data is key to this goal, and we're excited to share our work with the community.


**Features**
- Detects common sensitive entity types found in medical text (names, phone numbers, IDs, addresses, dates, etc.).
- Lightweight and fast to run on **CPUs** with the Hugging Face `transformers` pipeline.
- Designed to be used as part of a deidentification workflow (post-processing recommended to merge token-level spans).
- Trained on a **comprehensive synthetic dataset of over 300,000 samples**, ensuring it is robust and generalizable.
- On our internal test set, we achieved over 95% accuracy for our specific use case.


**Supported entity labels**
- PERSON
- PHONE
- EMAIL
- ADDRESS (sometimes labelled as LOCATION)
- DATE
- NATIONAL_ID
- HOSPITAL_IDS

## Quick start

Install minimal dependencies:

```
pip install -U transformers torch
```

Load and run the model with Hugging Face pipelines:

```python
from transformers import pipeline

ner = pipeline("token-classification", model="loolootech/no-name-ner-th", device=-1)
text = "คุณสมชายเป็นอะไรมาครับวันนี้ อ๋อวันนี้ปวดตับครับ งั้นวันนี้หมอขอตรวจละเอียดหน่อยนะ ได้เลยครับน้องมาร์ค"
results = ner(text)
print(results)
```

Notes on post-processing (more details on our [example notebook](https://github.com/loolootech/no-name-ner-th/blob/main/example.ipynb))
- The pipeline returns token-level predictions (B-/I- style). For redaction or anonymization you should merge adjacent tokens with the same label to form full spans before replacing with entity-specific redaction tokens (e.g. [PERSON], [PHONE]).
- When redacting, replace spans from right-to-left or rebuild the output string from slices to avoid offset shifts.


## Disclaimer

* This model is intended as an assistive tool for de-identification. It is not a substitute for professional, legal, or medical advice.

* Users are fully responsible for ensuring compliance with applicable privacy, legal, and regulatory requirements.

* While efforts have been made to improve accuracy, no automated system is 100% reliable. We strongly recommend implementing a regular human review process to validate outputs.


## **License**
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License ([CC BY-NC 4.0](LICENSE)).

- For commercial usage, please contact contact@looloohealth.com.


## **Citation**

If you use the model, you can cite it with the following bibtex.

```
@misc {no_name_ner_th,
    author       = { Atirut Boribalburephan, Chiraphat Boonnag, Knot Pipatsrisawat },
    title        = { no-name-ner-th },
    year         = 2025,
    url          = { https://huggingface.co/loolootech/no-name-ner-th },
    publisher    = { Hugging Face }
}
```


## **Acknowledgement**
We extend our gratitude to the `PhayaThaiBERT` team and `Pavarissy/phayathaibert-thainer` for providing the initial checkpoint for our model, which served as a crucial starting point. We also acknowledge PyThaiNLP for their invaluable contribution of the `thainer-corpus-v2` dataset, which was essential for training and evaluation.