Chapter 8: Interpretability

Mechanistic interpretability aims to reverse-engineer the weights of neural networks into human-understandable programs.

Optional reading

Feature Visualization - A method of interpreting a neuron by optimizing an input image based on the neuron's output.
Building Blocks - Explores interfaces that combine feature visualization with attribution, which tries to identify which inputs or prior neurons are responsible for an output or a later neuron's output.
Multimodal Neurons - An exploration of CLIP neurons, which correspond to a remarkable array of concepts.
A Mathematical Framework for Transformer Circuits - A preliminary framework for transformer interpretability, which helps give an intuition for what computations transformers are actually performing. There are accompanying exercises and videos.
Induction Heads - A study of induction heads, a mechanism for copying text that appears early in language model training.
SoLU - An activation function that improves language model interpretability. Includes some interesting examples of language model neuron interpretations.
Concrete Steps to Get Started in Transformer Mechanistic Interpretability - A much more detailed guide to getting into interpretability for transformers.
Chris Olah's views on AGI safety - Some high-level thoughts on how interpretability could help ensure the safety of advanced AI systems.

Train a small CNN to do MNIST classification, and try your best to mechanistically understand how it is able to classify digits correctly.

Try feature visualization, but note that it often doesn't work for models trained on such a simple dataset.
Try attribution, which is more likely to help.
Use whatever other techniques you like. Consider focusing in on a small part of the network and trying to understand it as best you can.

OR

Find the induction heads in GPT-2. Do this exercise if you want to prioritize transformer interpretability.