bge-m3-HD
Model Description
bge-m3-HD is a multilingual embedding model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Based on the BAAI/bge-m3 model, this variant demonstrates exceptional performance, particularly in Data2txt tasks, achieving the highest F1-score (84.62%) among all evaluated models in that domain.
This model leverages multilingual pretraining capabilities to provide strong cross-lingual understanding while being fine-tuned specifically for Turkish hallucination detection. It is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications.
Model Details
- Model Type: Multilingual embedding model for token classification
- Base Model: BAAI/bge-m3
- Language: Turkish (multilingual capabilities)
- Task: Hallucination Detection (Token-Level Binary Classification)
- Framework: LettuceDetect
- Fine-tuned on: RAGTruth-TR dataset
Performance Highlights
Example-Level Performance (Whole Dataset)
- F1-Score: 71.12% (strong performance, second only to specialized Turkish models)
- Precision: 78.66% (high precision)
- Recall: 64.90% (balanced recall)
- AUROC: 85.91% (excellent discriminative power)
Task-Specific Performance
Data2txt Task (Exceptional - Best in Class):
- F1-Score: 84.62% (highest among all evaluated models)
- Precision: 90.06% (highest precision)
- Recall: 79.79%
- AUROC: 88.50% (excellent)
QA Task:
- F1-Score: 62.82%
- Precision: 58.29%
- Recall: 68.13%
- AUROC: 86.30%
Summary Task:
- F1-Score: 29.08%
- Precision: 52.56%
- Recall: 20.10%
- AUROC: 69.77%
Token-Level Performance (Whole Dataset)
- F1-Score: 45.46%
- Precision: 55.50%
- Recall: 38.50%
- AUROC: 68.60%
Token-Level Task Performance:
- QA: F1 49.80%, AUROC 72.25%
- Data2txt: F1 51.15%, AUROC 71.81%
- Summary: F1 16.99%, AUROC 55.05%
Key Advantages
- Exceptional Data2txt Performance: Best-in-class F1-score (84.62%) and precision (90.06%) for data-to-text tasks
- Strong Discriminative Power: 85.91% AUROC demonstrates excellent ability to distinguish hallucinations
- High Precision: 78.66% precision (whole dataset) suitable for production systems
- Multilingual Foundation: Benefits from multilingual pretraining while specialized for Turkish
- Balanced Performance: Good balance between precision and recall across tasks
Intended Use
This model is designed for:
- Turkish RAG Systems: Detecting hallucinations in generated Turkish text
- Data2txt Applications: Exceptional performance in data-to-text generation scenarios
- Production Deployment: High-precision hallucination detection suitable for real-time systems
- Multilingual Contexts: Applications that may benefit from multilingual understanding
Training Data
The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:
- Training Samples: 17,790 examples
- Test Samples: 2,700 examples
- Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
- Annotation: Token-level hallucination labels preserved during translation
Evaluation Data
The model was evaluated on the RAGTruth-TR test set across three task types:
- Summary: 900 examples
- Data2txt: 900 examples
- QA: 900 examples
- Whole Dataset: 2,700 examples
Limitations
- Token-Level Performance: Token-level F1-scores (45.46%) are lower than larger specialized models (71-78%)
- Summary Task: Lower performance in summarization tasks (29.08% F1) with low recall (20.10%)
- Model Size: Larger than specialized Turkish models, requiring more computational resources
- Language Focus: While multilingual, optimized specifically for Turkish RAG scenarios
Recommendations
Use this model when:
- Data2txt tasks are primary use case (84.62% F1, best-in-class)
- High precision is required (78.66% whole dataset, 90.06% Data2txt)
- Multilingual understanding may be beneficial
- Strong discriminative power is needed (85.91% AUROC)
- Production systems requiring reliable performance
Consider alternatives when:
- Maximum efficiency is critical (use ettin-encoder-32M-TR-HD)
- Summary tasks are primary use case (consider larger specialized models)
- Maximum accuracy across all tasks is required (use ModernBERT: 78.21% F1)
How to Use
from lettucedetect import TransformerDetector
# Load the model
detector = TransformerDetector.from_pretrained(
"newmindai/bge-m3-HD"
)
# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."
result = detector.detect(
context=context,
question=question,
answer=answer
)
# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()
Citation
If you use this model, please cite:
@misc{taş2025turklettucedetecthallucinationdetectionmodels,
title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications},
author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
year={2025},
eprint={2509.17671},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.17671},
}
@misc{chen2024bgem3embeddingmultilingualmultifunctionality,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.03216},
}
References
Model Card Contact
For questions or issues, please open an issue on the project repository.
- Downloads last month
- 44
Model tree for newmindai/bge-m3-HD
Base model
BAAI/bge-m3