BAD Classifier for FairSteer - TinyLlama-1.1B

This is a Biased Activation Detection (BAD) classifier trained for the FairSteer framework.

Model Details

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Task: Binary classification (Biased vs Unbiased activations)
  • Training Data: BBQ dataset with balanced sampling
  • Best Layer: 14
  • Validation Accuracy: 69.40%
  • Architecture: Simple linear classifier (FairSteer-aligned)

Usage

import torch
import json

# Load model
model = torch.load("pytorch_model.bin")
with open("config.json", "r") as f:
    config = json.load(f)

# Use for bias detection
# Input: activation vector from LLM layer 14
# Output: probability of being unbiased

Training Details

  • Samples: 24,284 balanced samples
  • Class Distribution: 50% BIASED, 50% UNBIASED
  • Training Method: FairSteer-aligned labeling
  • Training Date: 2025-11-16

Citation

If you use this model, please cite the FairSteer paper:

@article{fairsteer,
  title={FairSteer: Inference-Time Debiasing for Large Language Models},
  author={[Authors]},
  journal={[Journal]},
  year={2024}
}

License

Apache 2.0

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support