bge-m3-HD

Model Description

bge-m3-HD is a multilingual embedding model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Based on the BAAI/bge-m3 model, this variant demonstrates exceptional performance, particularly in Data2txt tasks, achieving the highest F1-score (84.62%) among all evaluated models in that domain.

This model leverages multilingual pretraining capabilities to provide strong cross-lingual understanding while being fine-tuned specifically for Turkish hallucination detection. It is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications.

Model Details

  • Model Type: Multilingual embedding model for token classification
  • Base Model: BAAI/bge-m3
  • Language: Turkish (multilingual capabilities)
  • Task: Hallucination Detection (Token-Level Binary Classification)
  • Framework: LettuceDetect
  • Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

  • F1-Score: 71.12% (strong performance, second only to specialized Turkish models)
  • Precision: 78.66% (high precision)
  • Recall: 64.90% (balanced recall)
  • AUROC: 85.91% (excellent discriminative power)

Task-Specific Performance

Data2txt Task (Exceptional - Best in Class):

  • F1-Score: 84.62% (highest among all evaluated models)
  • Precision: 90.06% (highest precision)
  • Recall: 79.79%
  • AUROC: 88.50% (excellent)

QA Task:

  • F1-Score: 62.82%
  • Precision: 58.29%
  • Recall: 68.13%
  • AUROC: 86.30%

Summary Task:

  • F1-Score: 29.08%
  • Precision: 52.56%
  • Recall: 20.10%
  • AUROC: 69.77%

Token-Level Performance (Whole Dataset)

  • F1-Score: 45.46%
  • Precision: 55.50%
  • Recall: 38.50%
  • AUROC: 68.60%

Token-Level Task Performance:

  • QA: F1 49.80%, AUROC 72.25%
  • Data2txt: F1 51.15%, AUROC 71.81%
  • Summary: F1 16.99%, AUROC 55.05%

Key Advantages

  1. Exceptional Data2txt Performance: Best-in-class F1-score (84.62%) and precision (90.06%) for data-to-text tasks
  2. Strong Discriminative Power: 85.91% AUROC demonstrates excellent ability to distinguish hallucinations
  3. High Precision: 78.66% precision (whole dataset) suitable for production systems
  4. Multilingual Foundation: Benefits from multilingual pretraining while specialized for Turkish
  5. Balanced Performance: Good balance between precision and recall across tasks

Intended Use

This model is designed for:

  • Turkish RAG Systems: Detecting hallucinations in generated Turkish text
  • Data2txt Applications: Exceptional performance in data-to-text generation scenarios
  • Production Deployment: High-precision hallucination detection suitable for real-time systems
  • Multilingual Contexts: Applications that may benefit from multilingual understanding

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

  • Training Samples: 17,790 examples
  • Test Samples: 2,700 examples
  • Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
  • Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

  • Summary: 900 examples
  • Data2txt: 900 examples
  • QA: 900 examples
  • Whole Dataset: 2,700 examples

Limitations

  1. Token-Level Performance: Token-level F1-scores (45.46%) are lower than larger specialized models (71-78%)
  2. Summary Task: Lower performance in summarization tasks (29.08% F1) with low recall (20.10%)
  3. Model Size: Larger than specialized Turkish models, requiring more computational resources
  4. Language Focus: While multilingual, optimized specifically for Turkish RAG scenarios

Recommendations

Use this model when:

  • Data2txt tasks are primary use case (84.62% F1, best-in-class)
  • High precision is required (78.66% whole dataset, 90.06% Data2txt)
  • Multilingual understanding may be beneficial
  • Strong discriminative power is needed (85.91% AUROC)
  • Production systems requiring reliable performance

Consider alternatives when:

  • Maximum efficiency is critical (use ettin-encoder-32M-TR-HD)
  • Summary tasks are primary use case (consider larger specialized models)
  • Maximum accuracy across all tasks is required (use ModernBERT: 78.21% F1)

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/bge-m3-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

@misc{chen2024bgem3embeddingmultilingualmultifunctionality,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.03216}, 
}

References

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month
44
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for newmindai/bge-m3-HD

Base model

BAAI/bge-m3
Finetuned
(346)
this model

Collection including newmindai/bge-m3-HD