Member-only story
|LLM|INTERPRETABILITY|XAI|
Clear Waters: What an LLM Thinks Under the Surface
Anthropic’s Take at Decoding Abstract Features in Large Language Models

All things are subject to interpretation whichever interpretation prevails at a given time is a function of power and not truth. — Friedrich Nietzsche
Anthropic in recent times has been pushing quite a bit on interpretability in Large Language Models. According to the company, neural networks learn features that are meaningful. The problem is being able to visualize.
The transformer itself is difficult to interpret. Not that it is better for neural networks, but the transformer has taken the black box concept to another level. So from the very beginning approaches were sought to be able to try to interpret the model. Early approaches focused on trying to visualize attention heads

However, this approach is a little simplistic. The various attention heads specialize in learning complicated relationships that are not easy to identify (deeper layers learn relationships that are sophisticated and not immediate to understand). Moreover, there are now dozens of attention heads per layer. Not to mention that attention heads at a certain layer learn a representation that is dependent on previous layers.
Opening the black box doesn’t necessarily help: the internal state of the model — what the model is “thinking” before writing its response — consists of a long list of numbers (“neuron activations”) without a clear meaning. (source)
An LLM can understand a large number of concepts, and certainly, these concepts are in a way related to an internal representation. The problem is trying to discern which concepts…