bge-m3-HD

Model Description

bge-m3-HD is a multilingual embedding model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Based on the BAAI/bge-m3 model, this variant demonstrates exceptional performance, particularly in Data2txt tasks, achieving the highest F1-score (84.62%) among all evaluated models in that domain.

This model leverages multilingual pretraining capabilities to provide strong cross-lingual understanding while being fine-tuned specifically for Turkish hallucination detection. It is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications.

Model Details

Model Type: Multilingual embedding model for token classification
Base Model: BAAI/bge-m3
Language: Turkish (multilingual capabilities)
Task: Hallucination Detection (Token-Level Binary Classification)
Framework: LettuceDetect
Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

F1-Score: 71.12% (strong performance, second only to specialized Turkish models)
Precision: 78.66% (high precision)
Recall: 64.90% (balanced recall)
AUROC: 85.91% (excellent discriminative power)

Task-Specific Performance

Data2txt Task (Exceptional - Best in Class):

F1-Score: 84.62% (highest among all evaluated models)
Precision: 90.06% (highest precision)
Recall: 79.79%
AUROC: 88.50% (excellent)

QA Task:

F1-Score: 62.82%
Precision: 58.29%
Recall: 68.13%
AUROC: 86.30%

Summary Task:

F1-Score: 29.08%
Precision: 52.56%
Recall: 20.10%
AUROC: 69.77%

Token-Level Performance (Whole Dataset)

F1-Score: 45.46%
Precision: 55.50%
Recall: 38.50%
AUROC: 68.60%

Token-Level Task Performance:

QA: F1 49.80%, AUROC 72.25%
Data2txt: F1 51.15%, AUROC 71.81%
Summary: F1 16.99%, AUROC 55.05%

Key Advantages

Exceptional Data2txt Performance: Best-in-class F1-score (84.62%) and precision (90.06%) for data-to-text tasks
Strong Discriminative Power: 85.91% AUROC demonstrates excellent ability to distinguish hallucinations
High Precision: 78.66% precision (whole dataset) suitable for production systems
Multilingual Foundation: Benefits from multilingual pretraining while specialized for Turkish
Balanced Performance: Good balance between precision and recall across tasks

Intended Use

This model is designed for:

Turkish RAG Systems: Detecting hallucinations in generated Turkish text
Data2txt Applications: Exceptional performance in data-to-text generation scenarios
Production Deployment: High-precision hallucination detection suitable for real-time systems
Multilingual Contexts: Applications that may benefit from multilingual understanding

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

Training Samples: 17,790 examples
Test Samples: 2,700 examples
Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

Summary: 900 examples
Data2txt: 900 examples
QA: 900 examples
Whole Dataset: 2,700 examples

Limitations

Token-Level Performance: Token-level F1-scores (45.46%) are lower than larger specialized models (71-78%)
Summary Task: Lower performance in summarization tasks (29.08% F1) with low recall (20.10%)
Model Size: Larger than specialized Turkish models, requiring more computational resources
Language Focus: While multilingual, optimized specifically for Turkish RAG scenarios

Recommendations

Use this model when:

Data2txt tasks are primary use case (84.62% F1, best-in-class)
High precision is required (78.66% whole dataset, 90.06% Data2txt)
Multilingual understanding may be beneficial
Strong discriminative power is needed (85.91% AUROC)
Production systems requiring reliable performance

Consider alternatives when:

Maximum efficiency is critical (use ettin-encoder-32M-TR-HD)
Summary tasks are primary use case (consider larger specialized models)
Maximum accuracy across all tasks is required (use ModernBERT: 78.21% F1)

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/bge-m3-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

@misc{chen2024bgem3embeddingmultilingualmultifunctionality,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.03216}, 
}

References

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month: 44

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for newmindai/bge-m3-HD

Base model

BAAI/bge-m3

Finetuned

(346)

this model

Collection including newmindai/bge-m3-HD

Turkish Hallucination Detection Models

Collection

13 items • Updated 4 days ago • 6