|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- fact-checking |
|
|
- misinformation-detection |
|
|
- bert |
|
|
- modernbert |
|
|
datasets: |
|
|
- FELM |
|
|
- FEVER |
|
|
- HaluEval |
|
|
- LIAR |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# ModernBERT Fact-Checking Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a fine-tuned ModernBERT model for binary fact-checking classification, trained on consolidated datasets from multiple authoritative sources. The model determines whether a given claim is likely to be true (label 1) or false (label 0). |
|
|
|
|
|
**Base Model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use |
|
|
- Automated fact-checking systems |
|
|
- Misinformation detection pipelines |
|
|
- Content moderation tools |
|
|
|
|
|
### Out-of-Scope Uses |
|
|
- Multilingual fact-checking (English only) |
|
|
- Medical/legal claim verification |
|
|
- Highly domain-specific claims |
|
|
|
|
|
### How to use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/modernbert-factchecking") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("your-username/modernbert-factchecking") |
|
|
|
|
|
inputs = tokenizer("Your claim to verify here", return_tensors="pt", truncation=True, max_length=512) |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.softmax(outputs.logits, dim=-1) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a combination of four datasets: |
|
|
|
|
|
| Dataset | Samples | Domain | |
|
|
|---------|---------|--------| |
|
|
| FELM | 34,000 | General claims | |
|
|
| FEVER | 145,000 | Wikipedia-based claims | |
|
|
| HaluEval | 12,000 | QA hallucination detection | |
|
|
| LIAR | 12,800 | Political claims | |
|
|
|
|
|
**Total training samples:** ~203,800 |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
- Learning Rate: 5e-5 |
|
|
- Batch Size: 32 |
|
|
- Epochs: 1 |
|
|
- Max Sequence Length: 512 tokens |
|
|
- Optimizer: adamw_torch_fused |
|
|
|
|
|
### Preprocessing |
|
|
All datasets were converted to a standardized format: |
|
|
```python |
|
|
{ |
|
|
"text": "full claim text", |
|
|
"label": 0.0 or 1.0, |
|
|
"source": "dataset_name" |
|
|
} |