BAD Classifier for FairSteer - TinyLlama-1.1B
This is a Biased Activation Detection (BAD) classifier trained for the FairSteer framework.
Model Details
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Task: Binary classification (Biased vs Unbiased activations)
- Training Data: BBQ dataset with balanced sampling
- Best Layer: 14
- Validation Accuracy: 69.40%
- Architecture: Simple linear classifier (FairSteer-aligned)
Usage
import torch
import json
# Load model
model = torch.load("pytorch_model.bin")
with open("config.json", "r") as f:
config = json.load(f)
# Use for bias detection
# Input: activation vector from LLM layer 14
# Output: probability of being unbiased
Training Details
- Samples: 24,284 balanced samples
- Class Distribution: 50% BIASED, 50% UNBIASED
- Training Method: FairSteer-aligned labeling
- Training Date: 2025-11-16
Citation
If you use this model, please cite the FairSteer paper:
@article{fairsteer,
title={FairSteer: Inference-Time Debiasing for Large Language Models},
author={[Authors]},
journal={[Journal]},
year={2024}
}
License
Apache 2.0
- Downloads last month
- 18
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support