Alignment and AI Safety

In one line

You work on making powerful AI systems reliably do what people intend — and not do catastrophic things — even as they get more capable. This is my home turf, so expect me to have opinions.

What it actually is

“AI safety” is less a single track than a mission that cuts across the others. People do safety work as scientists, as engineers, as interp researchers, as policy people. What unites them is the question: as AI systems become more capable, how do we keep them controllable, honest, and beneficial?

Concretely, the technical sub-areas you’ll hear about (these are roughly the areas the big fellowships name):

  • Scalable oversight — how do we supervise systems that are smarter than us at the task?
  • AI control — how do we get useful work out of models we don’t fully trust, safely?
  • Adversarial robustness — making models hard to jailbreak or manipulate.
  • Model organisms (of misalignment) — deliberately building examples of the failure modes we worry about, so we can study them.
  • Mechanistic interpretability — seeing inside, covered in its own note.
  • Evaluations — measuring dangerous capabilities and propensities.
  • AI security and model welfare — newer areas, both increasingly funded.

What you actually do day to day

Depends which sub-area, but typically: pick a concrete safety question, design an experiment or build a system that tests it, run it, and publish so others can build on it. It looks a lot like research engineering or research science — the difference is what you point it at.

What you have to do to get in

The path

The safety ecosystem has the best beginner on-ramp of any part of AI, by design — the field is young and actively building its own talent pipeline. A realistic ladder:

  1. Learn the ideas: BlueDot AI Safety Fundamentals (free, no technical background needed).
  2. Build the skills: ARENA (engineering) or self-study.
  3. Do supervised research: SPAR, MATS, LASR Labs, Pivotal, ERA.
  4. Go pro: Anthropic Fellows, Astra, or a full role.

Skills required

It’s the same technical stack as Research Engineer / Research Scientist (see Skills Map), plus one thing that’s specific to safety:

  • Conceptual clarity about the problem. You should be able to explain why a given piece of work reduces risk, not just that it’s technically interesting. The field is full of work that’s clever but doesn’t connect to the threat model. Reviewers notice.

Is this you?

Signs you lean safety

  • You find the “what happens as these systems get much more capable” question genuinely gripping, not abstract.
  • You want your technical work to be about something with stakes.
  • You’re comfortable holding uncertainty — a lot of safety is reasoning under “we’re not sure yet.”

My honest take

The field needs careful, skeptical people more than it needs true believers. The thing that pulled me in wasn’t doom — it was realizing how much genuinely interesting, unsolved technical work sits between “these systems are powerful” and “we actually understand and can steer them.” Bring your skepticism; it’s an asset here, not a liability.

Pointers & extra resources

Interpretability · Research Scientist · Research Engineer · Policy and Governance · Tracks Overview