Interpretability

In one line

You try to reverse-engineer what’s actually happening inside a neural network — to turn the black box into something we can read. Part science, part detective work, very addictive.

What it actually is

Interpretability (often “interp,” and “mechanistic interpretability” / “mech interp” for the bottom-up flavor) is the effort to understand models internally — not just what they output, but what computations and representations produce those outputs. Think of it as neuroscience for artificial networks: probing activations, finding circuits, identifying features, building tools to see inside.

I’m a little biased — it’s close to my own work — but I think it’s the most beginner-friendly research track right now. It’s young enough that a motivated self-taught person can read the key papers in a few months and reach something like the frontier, and the community rewards good public work regardless of where you came from. A single strong interp project has launched a lot of careers.

What you actually do day to day

Form a hypothesis about how a model does something (“I bet there’s a circuit that tracks quotation marks”).
Run experiments on the model’s internals — activation patching, probing, ablations.
Build or use tooling to visualize what you find.
Write it up clearly, often as a public post, because legibility is the whole point.
A lot of “huh, that’s weird” followed by chasing the weird thing.

What you have to do to get in

The path

The cleanest on-ramp of any research track: do one good interp project and publish it. Work through ARENA’s interpretability material (public), pick a small question, and actually answer it. Programs that take interp people: MATS (several interp mentors), Anthropic Fellows (mech interp is an explicit area), SPAR.

Skills required

Solid engineering: interp is empirical, so you live in PyTorch and notebooks. (Python and PyTorch)
Transformer internals, deeply: you can’t reverse-engineer what you can’t first build. (Deep learning and transformers)
A puzzle temperament: comfort sitting with confusion, chasing anomalies, not needing tidy answers fast.
Clear writing & visualization: interp results that nobody can follow don’t count. (Career and Communication Skills)

Is this you?

Signs you lean interp

You loved the part of any subject where you got to look under the hood.

“Why did it do that?” is your favorite question.

You like making things legible — diagrams, explanations, “here’s what’s really going on.”

You’re okay with research that’s genuinely open-ended.

Pointers & extra resources

Learn Mechanistic Interpretability — a structured path in.
ARENA’s interpretability section — hands-on, public, the standard starting point.
Neel Nanda’s “barriers to entry are low” framing and his [200 Concrete Open Problems in Interpretability] writeups are a goldmine — see Reading and Courses.

Research Scientist · Research Engineer · Alignment and AI Safety · Skills Map · Tracks Overview

A Field Guide to AI Fellowships

Explorer

Interpretability

Interpretability

What it actually is

What you actually do day to day

What you have to do to get in

Skills required

Is this you?

Pointers & extra resources

Graph View

Table of Contents

Backlinks

A Field Guide to AI Fellowships

Explorer

Interpretability

Interpretability

What it actually is

What you actually do day to day

What you have to do to get in

Skills required

Is this you?

Pointers & extra resources

Related

Graph View

Table of Contents

Backlinks