Mechanistic interpretability aims to reverse-engineer the weights of neural networks into human-understandable programs.
- Zoom In - An introduction to the Circuits thread, a deep dive into an ImageNet classifier known as InceptionV1. This introduction provides some high-level motivation and examples.
- Feature Visualization - A method of interpreting a neuron by optimizing an input image based on the neuron's output.
- Building Blocks - Explores interfaces that combine feature visualization with attribution, which tries to identify which inputs or prior neurons are responsible for an output or a later neuron's output.
- Multimodal Neurons - An exploration of CLIP neurons, which correspond to a remarkable array of concepts.
- A Mathematical Framework for Transformer Circuits - A preliminary framework for transformer interpretability, which helps give an intuition for what computations transformers are actually performing. There are accompanying exercises and videos.
- Induction Heads - A study of induction heads, a mechanism for copying text that appears early in language model training.
- SoLU - An activation function that improves language model interpretability. Includes some interesting examples of language model neuron interpretations.
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability - A much more detailed guide to getting into interpretability for transformers.
- Chris Olah's views on AGI safety - Some high-level thoughts on how interpretability could help ensure the safety of advanced AI systems.
See also: 200 Concrete Open Problems in Mechanistic Interpretability
Train a small CNN to do MNIST classification, and try your best to mechanistically understand how it is able to classify digits correctly.
- Try feature visualization, but note that it often doesn't work for models trained on such a simple dataset.
- Try attribution, which is more likely to help.
- Use whatever other techniques you like. Consider focusing in on a small part of the network and trying to understand it as best you can.
OR
Find the induction heads in GPT-2. Do this exercise if you want to prioritize transformer interpretability.