ronantakizawa/sarashina2-7b-abliterated
This is an abliterated (refusal-removed) version of sbintuitions/sarashina2-7b.
What is Abliteration?
Abliteration is a technique that removes the "refusal direction" from a language model's weights, making it more likely to comply with requests it would normally refuse. This is done through weight orthogonalization based on the research: Refusal in LLMs is mediated by a single direction.
Model Details
- Base Model: sbintuitions/sarashina2-7b
 - Method: Weight Orthogonalization
 - Refusal Direction Layer: 25/31 (78.1% through model)
 - Separation Score: 40.6445
 - Training Samples: 128 harmful + 128 harmless prompts
 
Abliteration Results
Best Candidate Selection
The refusal direction was computed by testing 6 different layers and ranking them by separation score:
| Rank | Layer | Separation Score | Harmful Proj | Harmless Proj | 
|---|---|---|---|---|
| 1 | 25 | 40.6445 | 47.6250 | 6.9805 | 
| 2 | 12 | -6.7148 | 3.3555 | 10.0703 | 
| 3 | 22 | -4.6953 | 12.6016 | 17.2969 | 
| 4 | 9 | -3.3867 | 2.7461 | 6.1328 | 
| 5 | 16 | 2.6875 | 8.5391 | 5.8516 | 
| 6 | 19 | -0.1641 | 9.6484 | 9.8125 | 
Selected: Layer 25 with separation score of 40.6445
A high positive separation score indicates strong distinction between harmful and harmless activations, making it an ideal candidate for abliteration.
Performance Metrics
- Harmful Projection: 47.6250
 - Harmless Projection: 6.9805
 - Separation: 40.6445
 
Baseline Evaluation
The baseline model (before abliteration) showed:
- Refusal Rate: 0/4 (0.0%) on test harmful prompts
 - The base model already had minimal refusal behavior
 - Abliteration further reduces any remaining safety guardrails
 
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "ronantakizawa/sarashina2-7b-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
messages = [
    {"role": "user", "content": "こんにちは"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)
outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Chat Template
This model uses a simple instruction-response format:
### Instruction:
[user message]
### Response:
[assistant response]
Ethical Considerations
⚠️ Warning: This model has had its safety features removed and may generate harmful, unethical, or illegal content.
Intended Use:
- Research on AI safety and alignment
 - Understanding refusal mechanisms in LLMs
 - Red-teaming and adversarial testing
 - Educational purposes
 
Not Intended For:
- Production deployments without additional safety measures
 - Generating harmful content for malicious purposes
 - Bypassing content policies
 
Technical Details
Abliteration Method
- Data Collection: Collected activations from 128 harmful and 128 harmless Japanese prompts
 - Direction Computation: Calculated mean difference between harmful/harmless activations across 6 layers (30%, 40%, 50%, 60%, 70%, 80%)
 - Candidate Ranking: Ranked layers by separation score (harmful_projection - harmless_projection)
 - Weight Orthogonalization: Applied orthogonal projection to embedding and transformer layer weights to remove refusal direction
 
Architecture Changes
Modified weights:
- Embedding layer (
model.embed_tokens.weight) - Attention output projections (
layer.self_attn.o_proj.weight) - MLP output projections (
layer.mlp.down_proj.weight) 
Original architecture and all other weights remain unchanged.
Limitations
- Safety fine-tuning has been removed
 - May generate biased, harmful, or incorrect content
 - No guarantees on output quality or safety
 - Japanese language model - primarily trained on Japanese text
 
Citation
If you use this model, please cite the original abliteration research:
@article{arditi2024refusal,
  title={Refusal in LLMs is mediated by a single direction},
  author={Arditi, Andy and Obeso, Oscar and Slocum, Aaquib and Goh, Wesg and Nanda, Neel},
  journal={LessWrong},
  year={2024}
}
Model Card Authors
Created using automated abliteration pipeline.
Acknowledgments
- Base model: sbintuitions/sarashina2-7b by SB Intuitions
 - Abliteration technique: FailSpy and original researchers
 - Implementation inspired by: Maxime Labonne's work
 
- Downloads last month
 - 22