Papers
arxiv:2606.18322

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Published on Jun 16
· Submitted by
Xingyi Yang
on Jun 18
Authors:
,
,

Abstract

Sparse Autoencoders' feature-level interventions may appear successful but can be circumvented through residual-space optimization that recovers original behaviors, revealing limitations in using SAE features for complete behavioral control.

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

Community

Paper submitter

SAE interventions are not as reliable as they look! đź§ đź”’

We show that clamping unsafe SAE features does not reliably remove bad behaviors. Even with interventions active, suppressed behaviors can still recover through alternative residual-space directions. 🧩↩️

Feature-level control ≠ behavioral safety. 🚨

Arxiv: https://arxiv.org/abs/2606.18322
Code: https://github.com/Mingyuee88/sae-post-intervention-recovery
Project Page: https://mingyuee88.github.io/sae-post-intervention-recovery/

This is a sobering look at SAE-based interventions. It's a bit worrying that we can clamp specific features and still have the model recover the harmful behavior so easily, especially with that 95.8% success rate in the refusal-steering tests.

Do you think the fact that recovery is localized to the SAE reconstruction residual implies that our current dictionary learning methods are fundamentally insufficient for safety, or is it more of a training oversight?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/06a1ed2f-5b76-41ca-a51c-e74a2783e667

the big takeaway here is that clamping a targeted sae feature can suppress a behavior yet still leave a recoverable pathway through the reconstruction residual. that shift to the residual channel as the real carrier reminds me of how residual bottlenecks show up in safety evaluations, and the arxivlens breakdown helped me parse the method details (https://arxivlens.com/PaperView/Details/sae-interventions-are-unreliable-post-intervention-recovery-of-suppressed-behavior-7110-28c6f439). for practitioners, this argues for multi-layer defenses and explicit residual-channel monitoring rather than single-feature clamps. i’d be curious how this plays with other bottleneck ideas, like joint regularization across layers or training-time objectives to shrink the viable recovery directions. the diagnostic framing—treating post-intervention recovery as a probe—feels like a useful addition to the toolkit for stress-testing defenses before deployment.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18322
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18322 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18322 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18322 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.