ronantakizawa/sarashina2-7b-abliterated

This is an abliterated (refusal-removed) version of sbintuitions/sarashina2-7b.

What is Abliteration?

Abliteration is a technique that removes the "refusal direction" from a language model's weights, making it more likely to comply with requests it would normally refuse. This is done through weight orthogonalization based on the research: Refusal in LLMs is mediated by a single direction.

Model Details

  • Base Model: sbintuitions/sarashina2-7b
  • Method: Weight Orthogonalization
  • Refusal Direction Layer: 25/31 (78.1% through model)
  • Separation Score: 40.6445
  • Training Samples: 128 harmful + 128 harmless prompts

Abliteration Results

Best Candidate Selection

The refusal direction was computed by testing 6 different layers and ranking them by separation score:

Rank Layer Separation Score Harmful Proj Harmless Proj
1 25 40.6445 47.6250 6.9805
2 12 -6.7148 3.3555 10.0703
3 22 -4.6953 12.6016 17.2969
4 9 -3.3867 2.7461 6.1328
5 16 2.6875 8.5391 5.8516
6 19 -0.1641 9.6484 9.8125

Selected: Layer 25 with separation score of 40.6445

A high positive separation score indicates strong distinction between harmful and harmless activations, making it an ideal candidate for abliteration.

Performance Metrics

  • Harmful Projection: 47.6250
  • Harmless Projection: 6.9805
  • Separation: 40.6445

Baseline Evaluation

The baseline model (before abliteration) showed:

  • Refusal Rate: 0/4 (0.0%) on test harmful prompts
  • The base model already had minimal refusal behavior
  • Abliteration further reduces any remaining safety guardrails

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "ronantakizawa/sarashina2-7b-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "こんにちは"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Template

This model uses a simple instruction-response format:

### Instruction:
[user message]

### Response:
[assistant response]

Ethical Considerations

⚠️ Warning: This model has had its safety features removed and may generate harmful, unethical, or illegal content.

Intended Use:

  • Research on AI safety and alignment
  • Understanding refusal mechanisms in LLMs
  • Red-teaming and adversarial testing
  • Educational purposes

Not Intended For:

  • Production deployments without additional safety measures
  • Generating harmful content for malicious purposes
  • Bypassing content policies

Technical Details

Abliteration Method

  1. Data Collection: Collected activations from 128 harmful and 128 harmless Japanese prompts
  2. Direction Computation: Calculated mean difference between harmful/harmless activations across 6 layers (30%, 40%, 50%, 60%, 70%, 80%)
  3. Candidate Ranking: Ranked layers by separation score (harmful_projection - harmless_projection)
  4. Weight Orthogonalization: Applied orthogonal projection to embedding and transformer layer weights to remove refusal direction

Architecture Changes

Modified weights:

  • Embedding layer (model.embed_tokens.weight)
  • Attention output projections (layer.self_attn.o_proj.weight)
  • MLP output projections (layer.mlp.down_proj.weight)

Original architecture and all other weights remain unchanged.

Limitations

  • Safety fine-tuning has been removed
  • May generate biased, harmful, or incorrect content
  • No guarantees on output quality or safety
  • Japanese language model - primarily trained on Japanese text

Citation

If you use this model, please cite the original abliteration research:

@article{arditi2024refusal,
  title={Refusal in LLMs is mediated by a single direction},
  author={Arditi, Andy and Obeso, Oscar and Slocum, Aaquib and Goh, Wesg and Nanda, Neel},
  journal={LessWrong},
  year={2024}
}

Model Card Authors

Created using automated abliteration pipeline.

Acknowledgments

Downloads last month
22
Safetensors
Model size
7B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/sarashina2-7b-abliterated

Finetuned
(2)
this model
Quantizations
2 models