Interpretability
In one line
You try to reverse-engineer what’s actually happening inside a neural network — to turn the black box into something we can read. Part science, part detective work, very addictive.
What it actually is
Interpretability (often “interp,” and “mechanistic interpretability” / “mech interp” for the bottom-up flavor) is the effort to understand models internally — not just what they output, but what computations and representations produce those outputs. Think of it as neuroscience for artificial networks: probing activations, finding circuits, identifying features, building tools to see inside.
I’m a little biased — it’s close to my own work — but I think it’s the most beginner-friendly research track right now. It’s young enough that a motivated self-taught person can read the key papers in a few months and reach something like the frontier, and the community rewards good public work regardless of where you came from. A single strong interp project has launched a lot of careers.
What you actually do day to day
- Form a hypothesis about how a model does something (“I bet there’s a circuit that tracks quotation marks”).
- Run experiments on the model’s internals — activation patching, probing, ablations.
- Build or use tooling to visualize what you find.
- Write it up clearly, often as a public post, because legibility is the whole point.
- A lot of “huh, that’s weird” followed by chasing the weird thing.
What you have to do to get in
The path
The cleanest on-ramp of any research track: do one good interp project and publish it. Work through ARENA’s interpretability material (public), pick a small question, and actually answer it. Programs that take interp people: MATS (several interp mentors), Anthropic Fellows (mech interp is an explicit area), SPAR.
Skills required
- Solid engineering: interp is empirical, so you live in PyTorch and notebooks. (Python and PyTorch)
- Transformer internals, deeply: you can’t reverse-engineer what you can’t first build. (Deep learning and transformers)
- A puzzle temperament: comfort sitting with confusion, chasing anomalies, not needing tidy answers fast.
- Clear writing & visualization: interp results that nobody can follow don’t count. (Career and Communication Skills)
Is this you?
Signs you lean interp
- You loved the part of any subject where you got to look under the hood.
- “Why did it do that?” is your favorite question.
- You like making things legible — diagrams, explanations, “here’s what’s really going on.”
- You’re okay with research that’s genuinely open-ended.
Pointers & extra resources
- Learn Mechanistic Interpretability — a structured path in.
- ARENA’s interpretability section — hands-on, public, the standard starting point.
- Neel Nanda’s “barriers to entry are low” framing and his [200 Concrete Open Problems in Interpretability] writeups are a goldmine — see Reading and Courses.
Related
Research Scientist · Research Engineer · Alignment and AI Safety · Skills Map · Tracks Overview