Skip to content

Conversation

adamkarvonen
Copy link
Collaborator

When training SAEs on Qwen models, I found that as we move to later layers and larger models (e.g. Qwen3-32B), the activations begin to have random attention sinks with extremely high activation norms (> 100x the median). They appear randomly in the sequence, often on seemingly unimportant tokens like the (0) in my_list.append(0).

This reduces the frac variance explained by around 3% and adds loss spikes. It also adds a significant amount of dead features early in training (often 25% or more), although this does seem to go away after 100M tokens or so.

To remove this, I added an optional argument, which if present, we filter out activations with a norm greater than max_activation_norm_multiple * median activation norm.

As an example, we can see the orange and green lines both have multiple high activation norms in a single sequence.

Screenshot 2025-08-20 at 11 40 34 AM

@adamkarvonen adamkarvonen merged commit 60ec6bf into main Aug 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant