Interpretability

In one line

You try to reverse-engineer what’s actually happening inside a neural network — to turn the black box into something we can read. Part science, part detective work, very addictive.

What it actually is

Interpretability (often “interp,” and “mechanistic interpretability” / “mech interp” for the bottom-up flavor) is the effort to understand models internally — not just what they output, but what computations and representations produce those outputs. Think of it as neuroscience for artificial networks: probing activations, finding circuits, identifying features, building tools to see inside.

I’m a little biased — it’s close to my own work — but I think it’s the most beginner-friendly research track right now. It’s young enough that a motivated self-taught person can read the key papers in a few months and reach something like the frontier, and the community rewards good public work regardless of where you came from. A single strong interp project has launched a lot of careers.

What you actually do day to day

  • Form a hypothesis about how a model does something (“I bet there’s a circuit that tracks quotation marks”).
  • Run experiments on the model’s internals — activation patching, probing, ablations.
  • Build or use tooling to visualize what you find.
  • Write it up clearly, often as a public post, because legibility is the whole point.
  • A lot of “huh, that’s weird” followed by chasing the weird thing.

What you have to do to get in

The path

The cleanest on-ramp of any research track: do one good interp project and publish it. Work through ARENA’s interpretability material (public), pick a small question, and actually answer it. Programs that take interp people: MATS (several interp mentors), Anthropic Fellows (mech interp is an explicit area), SPAR.

Skills required

Is this you?

Signs you lean interp

  • You loved the part of any subject where you got to look under the hood.
  • “Why did it do that?” is your favorite question.
  • You like making things legible — diagrams, explanations, “here’s what’s really going on.”
  • You’re okay with research that’s genuinely open-ended.

Pointers & extra resources

Research Scientist · Research Engineer · Alignment and AI Safety · Skills Map · Tracks Overview