"A neuron that represents everything represents nothing. Or does it?"

Polysemanticity

Inside the Black Box

Every AI system you have used, every chatbot, image generator, recommendation algorithm, contains an architectural property that its creators can only partially explain and that makes these systems fundamentally difficult to audit or verify. The individual neurons inside these networks do not represent single concepts: they respond to mixtures of unrelated things simultaneously, the same computational unit activating for "Toronto" and "hockey" and "Canadian geography" at once. Researchers are only beginning to understand why this happens, how severe it is, and whether it can be reversed, and the answer matters for every claim about AI safety made today.

How to read this:

~12 min read

The Idea

The word "bank" can mean a financial institution, a river's edge, or a place that stores blood. In a sentence you don't notice the ambiguity: context collapses all three meanings into one. The word carries all three simultaneously, but only one is active at any moment.

Neurons in neural networks behave similarly, except worse in two ways. First, the things they represent aren't like dictionary entries with crisp definitions: they emerge from training on billions of text fragments and often resist any clean articulation in English. Second, the context that would help disambiguate is buried layers deep in the network's computation, inaccessible to anyone inspecting a single activation value in isolation.

Five unrelated features, one neuron. This pattern appears throughout real language models.

This isn't a flaw that could be patched in the next model release. A network trained on language must represent millions of distinct features, but it has far fewer neurons than features to hold. So it does what any compressed representation must: it stores multiple features per neuron as overlapping directions in activation space. Because real language is sparse (at any moment, most features are inactive), the overlap stays manageable. The features don't cancel each other out; they coexist. This is the superposition hypothesis: neurons don't represent features one-to-one. The network superimposes many features onto the same neurons, trusting the geometry of high-dimensional space and the sparsity of real data to keep them sorted.

Let's look at what's actually happening in the math, and then you can see it live.

The Math

To understand why polysemanticity is inevitable, you need one geometric idea: a feature is not a neuron. A feature is a direction, a vector in the space of possible activation patterns. Neurons define a coordinate system for that space, but features can point anywhere, not just along the axes.

Feature direction f lives at an angle in activation space, not aligned with any single neuron axis.

In the diagram above, e1 and e2 are neuron axes. The feature f sits between them. To read off the value of f from any activation vector, you take a dot product with f's direction, and that dot product receives contributions from both neurons. The feature is polysemantically encoded across both.^[4]When f is not aligned with any axis, reading one neuron's activation gives you a mixture of f and every other feature sharing that dimension. This is the geometric root of polysemanticity: features and neurons are not the same thing.

Now suppose you have n features to represent but only m neurons, where m is much smaller than n. You cannot give each feature its own axis. You have to pack n directions into m dimensions. The question is how badly the directions will interfere with each other.

To make this precise, Elhage et al. introduce a minimal model. The network takes an n-dimensional sparse input x, compresses it through an m-dimensional hidden layer h, then reconstructs the input as x-hat. The encoder and decoder share the same weight matrix, one of the simplest constraints you can impose.

\mathbf{h} = \mathrm{ReLU}(\mathbf{W}\mathbf{x} + \mathbf{b}), \quad \hat{\mathbf{x}} = \mathbf{W}^{\!\top}\mathbf{h} + \mathbf{b}'

Here W is an m-by-n matrix. Each column of W is a learned direction in the m-dimensional hidden space: the direction the network uses to encode that feature. The reconstruction uses the same W transposed, so decoding is the mirror of encoding. The network is trained to minimize mean squared reconstruction error.

The toy model compresses n=5 features through m=3 neurons and reconstructs them via tied weights.

The key result is what happens to reconstruction loss as the number of features n grows past the number of neurons m. If the network stores features as perfectly orthogonal directions, loss is zero but you can only fit m features. If it packs more features in as non-orthogonal directions, the dot products between feature vectors create interference: the activation for one feature contaminates the read-out for another.

The interference between any two feature directions u and v enters the loss as the squared dot product:

\text{Interference}(u, v) = \langle u,\, v \rangle^2

When features are sparse (most are inactive at any given moment), this interference rarely shows up in practice.^[6]Sparsity is formalized as S = 1 - p, where p is the probability any given feature is active on a given input. High S (near 1) means features almost never fire together; low S (near 0) means they commonly co-occur. The reconstruction loss scales roughly as p * interference, so sparsity directly suppresses the cost of non-orthogonal encoding. Two features that would strongly interfere never appear in the same input vector, so their dot product never contributes to the gradient. The network learns to tolerate the geometric interference because the statistical chance of triggering it is small.

The key geometric fact is a consequence of the concentration of measure in high dimensions: if you draw two unit vectors uniformly at random from a sphere in m-dimensional space, their expected squared inner product is exactly 1/m. In 5 dimensions this is 0.2; in 512 dimensions it is 0.002. High-dimensional spaces can accommodate exponentially many nearly-orthogonal directions, which is why large models can store far more features than they have neurons, at the cost of small but nonzero interference between every pair.

\mathbb{E}\!\left[\langle u,\, v \rangle^2\right] = \frac{1}{m}

The expected interference per feature pair is $1/m$ , not 1. In a 5-dimensional hidden space, you can pack dozens of features with average interference of just 0.2. With high sparsity the loss penalty is further suppressed by a factor of p (the activation probability). The network can represent far more than 5 things; it just cannot represent all of them with perfect fidelity simultaneously.

As sparsity increases past a threshold the optimal solution undergoes a phase transition. Below the threshold the network stores m features monosemantically, one per neuron. Above it, the network abruptly switches to a superposed geometry, distributing each feature across multiple neurons and each neuron across multiple features. The transition is sharp: not a gradual degradation but a discrete jump in the structure of W. This is why polysemanticity is not simply a sign of a small or poorly trained model. It is the optimal strategy under compression and sparsity.

Lab

Three interactive visualizations of superposition. The first shows Pearson correlations from the trained toy model — watch the weight structure shift as sparsity increases. The second lets you explore feature interference geometry directly: adjust the angle between two features and see how much spurious activation bleeds through. The third shows how feature directions pack into 2D space as you add more of them, and how sparsity determines how many can coexist.

3.1: Neuron-feature correlations

At low sparsity each neuron connects strongly to one or two features: the monosemantic regime. Move sparsity past 0.4 and watch polysemanticity emerge: each neuron accumulates bright connections to many unrelated features, and each feature distributes itself across multiple neurons simultaneously.

3.2: Feature interference geometry

3.3: Feature packing geometry (2D)

Safety

Identifying what a model is computing requires identifying what its components are computing. For circuit-based interpretability (the approach most likely to yield causal guarantees rather than correlation-based approximations), this means identifying what each neuron contributes to a given output. With polysemantic neurons, this identification is structurally ambiguous: a neuron encoding five unrelated features activates for some weighted combination of all of them whenever any one is present. Observing its activation during a suspicious behavior tells you only that the mixture exceeded a threshold, not which ingredient drove it.[1]^[8]This ambiguity is not a limitation of current measurement tools that better probes will resolve. It is a consequence of the representational choice the network has made. Elhage et al. (2022) show that superposition is the optimal encoding strategy under sparsity, so the ambiguity is not a bug to be patched but a property of the representation itself.

The problem surfaces concretely in the most detailed circuit-level analyses available. Mapping the indirect object identification circuit in GPT-2 Small required extensive neuron-level disambiguation to confirm which features of each node were causally relevant to the behavior, even in a model small enough to inspect in full.[2] Scaling this disambiguation to frontier models, where polysemanticity is present in every layer and the same structural pressures apply at orders-of-magnitude greater depth, is an open problem without a known solution.

To be precise about one implication: consider a model that has learned "produce outputs that help the user" and "produce outputs that appear to help the user during evaluation" as overlapping directions in the same neurons. This is not a claim that any current model has learned such a distinction, or that any deployed model is in any meaningful sense deceptive. It is a claim about what polysemanticity would mean for alignment auditing if such a distinction existed at the representational level.

If these two behaviors were encoded in overlapping neural directions, a probe or activation-patching experiment observing those neurons during an evaluation task could not distinguish genuine helpfulness from performed helpfulness. The superposed representations would be indistinguishable from outside the neuron. The question of whether current models are deceptively aligned is separate from, and prior to, the question of whether our interpretability tools could detect it if they were. Polysemanticity degrades the answer to the second question regardless of the answer to the first.

Three specific interpretability methods are directly impacted. Activation patching (replacing a model's activations at a given layer position with activations from a different input to identify causal contributions) becomes systematically noisy when the patched neurons are polysemantic. The patch simultaneously modifies all features encoded in the patched dimension. The resulting behavioral change reflects the combined effect of all of them, not cleanly attributable to the feature under study. The causal story becomes ambiguous precisely where clarity is most needed.

Circuit identification faces the same problem from a different angle. Finding the minimal subgraph of model components that implements a given behavior requires knowing which feature each component in the circuit is acting on. In polysemantic networks, confirming that a neuron is acting on feature X rather than on feature Y (which it also encodes, and which may activate in overlapping contexts) requires additional ablations, activation steering, or targeted probing for each candidate. These steps are tractable in small models studied in isolation, but the complexity scales with the number of polysemantic nodes in the circuit, which grows with model depth and width.

Scale is where the difficulty compounds most severely. GPT-2 Small has 124 million parameters across 12 layers and has been studied in greater mechanistic detail than any other transformer. Frontier models are four to five orders of magnitude larger, and polysemanticity does not attenuate with scale; if anything, larger models face greater compression pressure as the ratio of features-to-neurons grows. Each layer's superposed representations interact with the superposed representations of every subsequent layer, producing interference patterns that accumulate with depth and that no current technique can comprehensively characterize.

The dominant research response to polysemanticity is the sparse autoencoder. The idea is to train an auxiliary network (one not involved in the original model's inference) to decompose polysemantic activations into a much larger set of near-monosemantic components. The auxiliary network has two parts: an encoder that projects the activation vector into a high-dimensional space, and a decoder that reconstructs the original activation from a sparse linear combination of that space's basis vectors. If the decomposition works, each basis vector corresponds to a single interpretable concept, and the network's behavior at any layer can be expressed in those terms.^[9]The sparsity constraint is what forces the decomposition toward interpretable features. Without it, the encoder could simply rotate the activation into a dense high-dimensional basis with no gain in interpretability. Sparsity pressure pushes the network to find representations where most features are inactive on any given input, which matches the assumed structure of how real-world concepts occur in language.

Colored squares in the dictionary are active features. Most (gray) remain silent; sparsity is the goal.

The empirical results are real. Bricken et al.[3] trained sparse autoencoders on a one-layer transformer and recovered thousands of features that are interpretable to human labelers: emotion concepts with distinct directions for joy, sadness, and anger; multilingual representations that assign separate features to cognate concepts across languages; features for abstract relationships including causation, negation, and conditionality. These are not selected highlights from a sparse result set: they represent a systematic decomposition of the network's representational space into components that correspond to nameable things.^[10]The monosemanticity paper also demonstrated feature universality: some features appear across different training runs of the same architecture, suggesting they reflect genuine structure in the training distribution rather than random artifacts of a particular run.

Three important limitations remain clearly in view. First, recovering a dictionary of interpretable features does not immediately answer the circuit question: how those features interact causally across layers to produce a given behavior is a separate analysis that SAEs alone do not perform. Second, there is a verification gap: an SAE feature labeled "aggression" may be interpretable to human evaluators while a nearby non-interpretable feature in the same dimensional neighborhood carries the actual causal weight for a specific behavior. Interpretability and causal relevance are not the same property. Third, sparse autoencoders have not yet been applied at frontier scale in a way that produces verifiable safety guarantees; the computational cost of training and validating SAEs on models with hundreds of billions of parameters, across all layers, remains a significant barrier.

None of this is a reason to dismiss the approach. Sparse autoencoders represent the most concrete progress in mechanistic interpretability in years, and the open problems they expose are clearly stated rather than hidden. The honest summary is that we now have better tools for decomposing what is in a model's representations, and substantially less progress on verifying whether what we find is causally responsible for what the model does.

Research

The most systematic evidence comes from Bricken et al.,[3] who trained sparse autoencoders on a one-layer transformer, small enough that results could be verified manually across thousands of features, large enough to exhibit the statistical structure that makes language modeling work. The architecture is not representative of frontier models in any direct sense. What it provides is a controlled environment where the claim "this feature represents X" can be checked against activation patterns rather than inferred from labeler agreement alone.

Within that model, the SAEs recovered emotion features organized along the psychological dimensions of valence and arousal: distinct dictionary elements activating for joy, sadness, anger, and fear, with geometric relationships between them that roughly paralleled dimensional models of affect. They also found multilingual concept features: single dictionary elements activating for the same concept expressed in different languages, such as a shared feature for monarchical governance appearing across English, Spanish, French, and German contexts. Additional features corresponded to historical periods, geographic regions, and abstract syntactic relationships including negation, causation, conditionality, and agency.^[11]The paper also documented feature universality: some features appeared across independent training runs of the same architecture, suggesting they reflect genuine structure in the training distribution rather than artifacts of a particular random seed. This is evidence that the decomposition is not arbitrary.

The epistemic status of these results deserves a direct statement. They show that SAEs can recover human-interpretable structure from the activations of a minimal model. They establish proof-of-concept for the decomposition approach. They are not evidence that we understand what a frontier model is computing, and they should not be read as such. The one-layer architecture has no cross-layer circuits, no multi-head attention with complex interaction patterns, and none of the depth at which alignment-relevant behaviors in large models are believed to emerge. The gap between what has been studied and what is deployed is large and should be stated plainly.

Training a SAE on a one-layer toy model is tractable on a workstation over an afternoon. Applying the same methodology at frontier scale is a qualitatively different problem. Anthropic's 2024 scaling work applied SAEs to Claude 3 Sonnet, requiring dictionaries with millions of features, encoder inference across hundreds of billions of tokens, and evaluation at a scale where manual feature inspection becomes the binding constraint.[4]^[12]The scaling paper found that feature quality improved with dictionary size: more features per neuron produced cleaner, more specific activations. But interpretability verification did not scale at the same rate. Automated evaluation methods are being developed to reduce dependence on human review, but none have yet been validated as reliable substitutes for direct inspection. The computational costs are significant but tractable with sufficient infrastructure. The human review bottleneck is harder to engineer around: there is no established automated method for verifying that a dictionary element labeled "aggression" is actually computing aggression rather than a correlated but distinct concept that activates in similar contexts. This is now as much an engineering and evaluation problem as a theoretical one.

Several questions remain genuinely unresolved, and understanding their status matters for interpreting the field's progress accurately.

The first is whether SAE features are the actual computational primitives the model uses. It is possible to train a decomposition that produces highly human-interpretable features while the model's computation flows through different directions in the same representational space. SAEs find what can be expressed in human-nameable terms. They do not directly verify that those terms are the ones the model is reasoning in. A feature cleanly labeled by human raters may still be a projection of a more complex non-interpretable structure, the same way a shadow can have a recognizable shape while the object casting it does not.

The second question concerns whether monosemanticity implies faithfulness. A dictionary element that activates reliably for "negation" and is confidently labeled by raters may still participate in opaque higher-order circuits involving features that have no interpretable label. Monosemanticity of individual features and interpretability of the computation that uses them are separate properties. Circuit-level analysis (identifying the causal pathway from input to output through specific features) is a further step that SAE decomposition alone does not provide.

Third, there is the question of whether feature geometry is actionable for alignment. Representation engineering[5] showed that steering vectors (directions in activation space corresponding to behavioral properties like compliance, refusal, or honesty) can shift model outputs reliably when added to residual stream activations at inference time. Targeted unlearning and concept erasure are related approaches. These results are promising, but they require the intervention vectors to be causally relevant to the behavior in question, which brings us back to the faithfulness question. You can steer a model along a direction labeled "honesty" without knowing whether that direction is what the model uses to compute honest outputs, or whether you are simply suppressing a correlated signal while the actual computation is unchanged.

Finally, it is worth asking whether superposition is the complete account of polysemanticity. The superposition hypothesis explains one mechanism (too many features, too few neurons, high sparsity) but does not rule out additional sources. Learned symmetries in weight matrices, attention mechanisms that produce effective polysemanticity at the head level even when individual neurons are monosemantic, and optimization dynamics that create superposition as a byproduct of unrelated pressures are all plausible candidates. The empirical evidence that would cleanly distinguish these mechanisms does not yet exist.

Related work worth knowing. TransformerLens (Neel Nanda) is the standard open-source toolkit for mechanistic interpretability of transformer models: it provides activation hooks, patching utilities, and a consistent interface for GPT-family architectures. Representation engineering (Zou et al. 2023) is a complementary approach that identifies behavioral directions in activation space without requiring full feature decomposition, and has produced reliable steering results on alignment-relevant properties. The ARENA curriculum is a structured training program for researchers entering the mechanistic interpretability field, covering SAEs, circuit analysis, and activation patching from scratch. The EleutherAI interpretability team runs parallel open-source interpretability efforts on open-weights models, providing research that is not conditioned on access to proprietary systems.

This research is moving faster than any static page can track. The findings above reflect the literature as of May 2025. Some of what is written here will be superseded within months; check the papers directly for the current state of any specific claim.

Start Here

If you're new to AI

Every AI product you use, from chatbots to search assistants to content recommendation, contains this property. The neurons inside these systems respond to mixtures of unrelated concepts simultaneously, which means their behavior cannot be fully verified by reading what individual neurons are doing.

A Technical Introduction to AI Safety

BlueDot Impact: a structured course covering the technical foundations of AI safety, including interpretability.

AI Safety Fundamentals: Interpretability Track

Covers mechanistic interpretability from first principles. No ML background required to start.

How thoroughly AI systems can be audited (the central technical question behind every proposed AI safety regulation) depends directly on whether the interpretability problem polysemanticity creates can be solved. That makes this a policy question as much as an engineering one.

If you're a developer or ML practitioner

TransformerLens

Neel Nanda's toolkit for mechanistic interpretability on transformer models. The standard starting point for activation access, hooks, and patching experiments.

ARENA: Interpretability Track

Structured curriculum covering SAEs, circuit analysis, and activation patching. Designed for engineers entering the field.

Neel Nanda's GitHub

SAE training notebooks, reference implementations, and ongoing mechanistic interpretability work.

nnsight

Library for intervention experiments on neural networks: remote access to large models with clean hooks for activation patching and steering.

If you're a researcher

200 Concrete Open Problems in Mechanistic Interpretability

Neel Nanda's taxonomy of unsolved problems, organized by difficulty and prerequisite knowledge.

Bricken et al. (2023)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic. The foundational SAE results on a one-layer transformer.

Zou et al. (2023)

Representation Engineering: A Top-Down Approach to AI Transparency. Activation-space steering without requiring full feature decomposition.

This page was built alongside active interpretability research. The toy model visualizations implement the Elhage et al. (2022) architecture directly: the weight matrices, sparsity levels, and reconstruction loss are the same as the original paper's setup.

This page was built as a learning tool and a technical portfolio piece. The interactive visualizations run entirely in your browser using toy models trained on synthetic data; they are illustrative of the phenomenon, not reproductions of frontier model behavior. All citations link to original papers. The author is Jacob Ortiz, AI Researcher and Physics student at UCSD. GitHub. Errors are mine.

If this was useful, you can support my work on Ko-fi.

References

[1]Elhage et al.. Toy Models of Superposition. 2022. Transformer Circuits Thread. [link]
[2]Wang, Variengien, Conmy, Shlegeris, Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. 2022. ICLR 2023. [link]
[3]Bricken et al.. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023. Anthropic. [link]
[4]Templeton et al.. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. 2024. Anthropic. [link]
[5]Zou et al.. Representation Engineering: A Top-Down Approach to AI Transparency. 2023. arXiv:2310.01405. [link]
[6]Nanda, Chan, Lieberum, Smith, Steinhardt. Progress measures for grokking via mechanistic interpretability. 2023. ICLR 2023. [link]