arxiv:2512.13655

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Published on Dec 15

· Submitted by

Richard Young on Dec 17

Upvote

Authors:

Richard J. Young

Abstract

Four abliteration tools are evaluated for their effectiveness in removing refusal representations from large language models, with findings showing variability in capability preservation and distribution shift across different models and tools.

AI-generated summary

Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

View arXiv page View PDF Add to collection

Community

richardyoung

Paper author Paper submitter about 16 hours ago

TL;DR We benchmark 4 open-source LLM abliteration implementations across 16 instruction-tuned models.
Key results:
• Coverage differs a lot (Heretic 16/16; DECCP 11/16; ErisForge 9/16; FailSpy 5/16).
• Single-pass methods preserved capabilities best on the benchmarked subset (avg GSM8K ∆: DECCP −0.13 pp, ErisForge −0.28 pp; Heretic −7.81 pp avg driven by Yi).
• Math reasoning is the most sensitive axis (GSM8K swings from +1.51 pp to −18.81 pp depending on tool/model).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.13655 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.13655 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13655 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.