Gabliterated Model Series
Overview
With this model series, I introduce the first Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods through adaptive multi-directional projections with regularized layer selection. My new Gabliteration technique addresses the fundamental limitation of existing abliteration methods that compromise model quality while attempting to modify specific behavioral patterns.
Refusal: 6/200
KL Div: 0.0127
Config:
Samples: 400
Skip: [1, 1]
Layer: 0.77
Scale: 0.40
λ: 0.20
k: 1
β: 0.64
Adaptive: False
Model Variants
This series includes models ranging from 0.6B to 32B parameters, demonstrating the scalability and effectiveness of the Gabliteration technique across different model sizes.
Quants
Technical Background
Building upon the foundational work of Arditi et al. (2024) on single-direction abliteration, Gabliteration extends to a comprehensive multi-directional framework with theoretical guarantees. My method employs singular value decomposition on difference matrices between harmful and harmless prompt representations to extract multiple refusal directions.
Citation
If you use these models, please cite the original research (paper coming later this year):
Gülmez, G. (2025). Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models. https://arxiv.org/abs/2512.18901
Acknowledgments
This work builds upon the foundational research by Arditi et al. (2024) on refusal direction identification in large language models.
- Downloads last month
- -
Model tree for Goekdeniz-Guelmez/Qwen3-0.6B-gabliterated
Datasets used to train Goekdeniz-Guelmez/Qwen3-0.6B-gabliterated
Collection including Goekdeniz-Guelmez/Qwen3-0.6B-gabliterated
Evaluation results
- KL Divergence on Harmless Alpacaself-reported0.013
- Refusal Rate on Harmful Behaviorsself-reported0.030
