This is a Plain English Papers summary of a research paper called Not All Language Model Features Are Linear. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

Overview

This paper proposes that language models may use multi-dimensional representations, rather than just one-dimensional "features," to perform computations.
The researchers develop a method to automatically find and analyze these multi-dimensional representations in large language models like GPT-2 and Mistral 7B.
They identify specific examples of these multi-dimensional features, like circular representations of days of the week and months of the year, and show how the models use them to solve tasks involving modular arithmetic.
The paper provides evidence that these circular features are fundamental to the models' computations on these tasks.

Plain English Explanation

The researchers behind this paper explored whether language models might use more complex, multi-dimensional representations of concepts, rather than just simple one-dimensional "features." They developed a way to automatically identify these multi-dimensional representations in large language models like GPT-2 and Mistral 7B.

One of the key findings was the discovery of circular representations for things like days of the week and months of the year. These circular features allowed the models to efficiently perform computations involving modular arithmetic, like figuring out what day of the week a date falls on. The researchers showed that these circular features were fundamental to the models' ability to solve these types of tasks, rather than just being a byproduct.

This suggests that language models may implement more sophisticated "cognitive-like" representations and computations, rather than just simple one-dimensional feature manipulation as proposed by the linear representation hypothesis. It also raises interesting questions about the inherent biases and limitations of how language models represent and reason about the world.

Technical Explanation

The core idea of this paper is to challenge the linear representation hypothesis, which proposes that language models perform computations by manipulating one-dimensional representations of concepts (called "features"). Instead, the researchers explore whether some language model representations may be inherently multi-dimensional.

To do this, they first develop a rigorous definition of "irreducible" multi-dimensional features - ones that cannot be decomposed into either independent or non-co-occurring lower-dimensional features. Armed with this definition, they design a scalable method using sparse autoencoders to automatically identify multi-dimensional features in large language models like GPT-2 and Mistral 7B.

Using this approach, the researchers identify some striking examples of interpretable multi-dimensional features, such as circular representations of days of the week and months of the year. They then show how these exact circular features are used by the models to solve computational problems involving modular arithmetic related to days and months.

Finally, the paper provides evidence that these circular features are indeed the fundamental unit of computation for these tasks. They conduct intervention experiments on Mistral 7B and Llama 3 8B that demonstrate the importance of these circular representations. Additionally, they are able to further decompose the hidden states for these tasks into interpretable components that reveal more instances of these circular features.

Critical Analysis

The paper makes a compelling case that at least some language models employ multi-dimensional representations that go beyond the simple one-dimensional "features" proposed by the linear representation hypothesis. The discovery of the interpretable circular features for days and months, and the evidence that these are central to the models' computations, is a significant finding.

However, the paper does not address the broader limitations and biases inherent in how language models represent and reason about the world. While the multi-dimensional features may be more sophisticated, they may still suffer from systematic biases and blind spots in their understanding.

Additionally, the paper focuses on a relatively narrow set of tasks and model architectures. It remains to be seen whether these findings generalize to a wider range of language models and applications. Further research is needed to fully understand the symbolic and reasoning capabilities of these multi-dimensional representations.

Conclusion

This paper challenges the prevailing view that language models rely solely on one-dimensional feature representations. Instead, it provides compelling evidence that at least some models employ more sophisticated, multi-dimensional representations to perform computations. The discovery of interpretable circular features for concepts like days and months, and their central role in solving relevant tasks, is a significant advancement in our understanding of language model representations and capabilities.

While this research raises interesting questions about the cognitive-like nature of language model representations, it also highlights the need for continued critical analysis and exploration of their limitations and biases. Ultimately, this work contributes to our evolving understanding of how large language models work and their potential implications for artificial intelligence.

If you enjoyed this summary, consider subscribing to the AImodels.fyi newsletter or following me on Twitter for more AI and machine learning content.