Philipp Emanuel Weidmann
AI & ML interests
Recent Activity
Organizations
There is another possible interpretation of r∥μA, beyond the three that you already gave in the article: Namely, that even representations that lead to compliant responses already contain the seed of refusal in them, as part of a general "carefulness" induced by strong alignment. You can see e.g. in some reasoning models that the reasoning trace debates whether to refuse even for harmless prompts.
This would mean that r∥μA is in fact a genuine component of the refusal direction and not some confounding factor. In that case, we should expect that orthogonalizing only with respect to r⊥μA yields slightly weaker refusal suppression than orthogonalizing with respect to r. This in fact matches the observations from my few experiments with this technique so far, though of course you will have performed much more in-depth testing than I did.
Hi there, I'm working on a generic abliteration engine and have read this article with great interest! The implementation details are particularly enlightening, and Winsorizing the activations is a brilliant idea that I intend to copy.
I tried your "Projected Abliteration" proposal in my own engine today, which uses TPE optimization to find abliteration parameters, with a compound score that combines refusals on harmful prompts with KL divergence on harmless prompts to guide the optimizer. I found that replacing the standard refusal direction computation with a form that is essentially equivalent to what you described causes the optimization process to converge towards slightly worse scores, but that doesn't mean the idea isn't correct in general.
The theoretical justification, though, is somewhat less convincing. In particular, you claim that
[refusal] is instead of composed of a direction which pushes toward refusal and another direction which pushes away from compliance.
It's hard to see why the model would develop such a distinction internally, considering that it has never seen that distinction during training. Indeed, during instruction training, the model is shown exactly two classes of prompts:
- Harmless prompts, with compliant responses
- Harmful prompts, with refusal (and thus non-compliant) responses
Therefore, from the perspective of the model, refusal and non-compliance are the same thing. It has never seen a non-compliant response that wasn't also a refusal.
Then there is this part, which I am perhaps misunderstanding:
However, for the purposes of abliteration, where the refusal direction is to be ablated to allow a model to comply with a harmful prompt, removing a push away from compliance is ungrounded as compliance is the goal.
Isn't the exact opposite the case? If we remove a push away from compliance, we should expect to be moving towards compliance, no?
Like yourself, I have found that Gemma 3 (and Qwen 3, Phi 4, and GPT-OSS) are substantially more resistant to abliteration than earlier models. My own theory has been that those models are far "denser" than previous generations, meaning that crude interventions like orthogonalization are more likely to cause damage, so for a given amount of model damage, the effect of suppressing refusals is less pronounced. This is supported by the observation that those same models are also more sensitive to low-bit quantization than their predecessors.