Skip to content

Latest commit

 

History

History
32 lines (21 loc) · 3.24 KB

8-Interpretability.md

File metadata and controls

32 lines (21 loc) · 3.24 KB

Chapter 8: Interpretability

Mechanistic interpretability aims to reverse-engineer the weights of neural networks into human-understandable programs.

Recommended reading

  • Zoom In - An introduction to the Circuits thread, a deep dive into an ImageNet classifier known as InceptionV1. This introduction provides some high-level motivation and examples.

Optional reading

  • Feature Visualization - A method of interpreting a neuron by optimizing an input image based on the neuron's output.
  • Building Blocks - Explores interfaces that combine feature visualization with attribution, which tries to identify which inputs or prior neurons are responsible for an output or a later neuron's output.
  • Multimodal Neurons - An exploration of CLIP neurons, which correspond to a remarkable array of concepts.
  • A Mathematical Framework for Transformer Circuits - A preliminary framework for transformer interpretability, which helps give an intuition for what computations transformers are actually performing. There are accompanying exercises and videos.
  • Induction Heads - A study of induction heads, a mechanism for copying text that appears early in language model training.
  • SoLU - An activation function that improves language model interpretability. Includes some interesting examples of language model neuron interpretations.
  • Concrete Steps to Get Started in Transformer Mechanistic Interpretability - A much more detailed guide to getting into interpretability for transformers.
  • Chris Olah's views on AGI safety - Some high-level thoughts on how interpretability could help ensure the safety of advanced AI systems.

Suggested exercise

See also: 200 Concrete Open Problems in Mechanistic Interpretability

Train a small CNN to do MNIST classification, and try your best to mechanistically understand how it is able to classify digits correctly.

  • Try feature visualization, but note that it often doesn't work for models trained on such a simple dataset.
  • Try attribution, which is more likely to help.
  • Use whatever other techniques you like. Consider focusing in on a small part of the network and trying to understand it as best you can.

OR

Find the induction heads in GPT-2. Do this exercise if you want to prioritize transformer interpretability.