insuperabile
/

modernbert-factcheck

misinformation-detection

Model card Files Files and versions

modernbert-factcheck / README.md

insuperabile's picture

Update README.md

0528e7c verified 6 months ago

|

1.97 kB

	---
	language:
	- en
	tags:
	- fact-checking
	- misinformation-detection
	- bert
	- modernbert
	datasets:
	- FELM
	- FEVER
	- HaluEval
	- LIAR
	metrics:
	- accuracy
	- f1
	---

	# ModernBERT Fact-Checking Model

	## Model Description

	This is a fine-tuned ModernBERT model for binary fact-checking classification, trained on consolidated datasets from multiple authoritative sources. The model determines whether a given claim is likely to be true (label 1) or false (label 0).

	Base Model: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)

	## Intended Uses

	### Primary Use
	- Automated fact-checking systems
	- Misinformation detection pipelines
	- Content moderation tools

	### Out-of-Scope Uses
	- Multilingual fact-checking (English only)
	- Medical/legal claim verification
	- Highly domain-specific claims

	### How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("your-username/modernbert-factchecking")
	model = AutoModelForSequenceClassification.from_pretrained("your-username/modernbert-factchecking")

	inputs = tokenizer("Your claim to verify here", return_tensors="pt", truncation=True, max_length=512)
	outputs = model(**inputs)
	predictions = torch.softmax(outputs.logits, dim=-1)
	```

	## Training Data

	The model was trained on a combination of four datasets:

	\| Dataset \| Samples \| Domain \|
	\|---------\|---------\|--------\|
	\| FELM \| 34,000 \| General claims \|
	\| FEVER \| 145,000 \| Wikipedia-based claims \|
	\| HaluEval \| 12,000 \| QA hallucination detection \|
	\| LIAR \| 12,800 \| Political claims \|

	Total training samples: ~203,800

	## Training Procedure

	### Hyperparameters
	- Learning Rate: 5e-5
	- Batch Size: 32
	- Epochs: 1
	- Max Sequence Length: 512 tokens
	- Optimizer: adamw_torch_fused

	### Preprocessing
	All datasets were converted to a standardized format:
	```python
	{
	"text": "full claim text",
	"label": 0.0 or 1.0,
	"source": "dataset_name"
	}