Key Papers

The specific stuff. Most “how to get into AI” guides stay vague — here are the actual papers I’d hand someone, grouped so you can read in an order that builds. You do not need to read all of these, and you definitely don’t need to understand every equation. Read the ones that match the track pulling at you, and read them the smart way (skim first — Reading papers).

How to use this list

Pick your track, read the 3–4 papers under it plus the two foundations, and write one paragraph on each in your own words. That paragraph is worth more than ten passive reads — and it’s the start of a portfolio. (The portfolio mindset)

About the links

Interpretability papers live on transformer-circuits.pub; the rest are on arXiv. If a link ever rots, just search the exact title — every paper here is famous enough to be the top result.

Start here (foundations — everyone reads these)

Attention Is All You Need — Vaswani et al. (2017). The transformer. Everything downstream is built on this one. → https://arxiv.org/abs/1706.03762
Language Models are Few-Shot Learners — Brown et al. (2020). GPT-3; the paper that made scaling impossible to ignore. → https://arxiv.org/abs/2005.14165
On the Opportunities and Risks of Foundation Models — Bommasani et al. (2021). The big-picture map of the whole era, capabilities and risks together. → https://arxiv.org/abs/2108.07258

How models get aligned to humans (RLHF & friends)

For Research Engineer, Applied and Product ML, Alignment and AI Safety.

Deep Reinforcement Learning from Human Preferences — Christiano et al. (2017). The seed of RLHF. → https://arxiv.org/abs/1706.03741
Training language models to follow instructions with human feedback — Ouyang et al. (2022). InstructGPT — RLHF at scale, the recipe behind every modern chat model. → https://arxiv.org/abs/2203.02155
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (2022). Using AI feedback and a written “constitution” instead of only human labels. → https://arxiv.org/abs/2212.08073

Why safety is actually hard (the conceptual core)

For Alignment and AI Safety, and anyone who wants to sound like they understand the problem.

Concrete Problems in AI Safety — Amodei, Olah, et al. (2016). The field’s founding problem list. Still the single best on-ramp. → https://arxiv.org/abs/1606.06565
Unsolved Problems in ML Safety — Hendrycks et al. (2021). A more modern, concrete research agenda. → https://arxiv.org/abs/2109.13916
Risks from Learned Optimization in Advanced ML Systems — Hubinger et al. (2019). Where “mesa-optimization” / inner alignment comes from. Dense but foundational. → https://arxiv.org/abs/1906.01820

Scalable oversight & control (supervising systems near or above our level)

The deep end of Alignment and AI Safety.

AI Safety via Debate — Irving et al. (2018). → https://arxiv.org/abs/1805.00899
Supervising Strong Learners by Amplifying Weak Experts — Christiano et al. (2018). Iterated amplification. → https://arxiv.org/abs/1810.08575
Measuring Progress on Scalable Oversight for LLMs — Bowman et al. (2022). → https://arxiv.org/abs/2211.03540
Weak-to-Strong Generalization — Burns et al. (2023). Can a weak supervisor elicit a strong model’s full ability? → https://arxiv.org/abs/2312.09390
AI Control: Improving Safety Despite Intentional Subversion — Greenblatt et al. (2023). Safety even if the model is trying to subvert you — a big recent reframing. → https://arxiv.org/abs/2312.06942

Evals & failure modes (measuring danger)

For anyone interested in Alignment and AI Safety evals or Policy and Governance.

Discovering Language Model Behaviors with Model-Written Evaluations — Perez et al. (2022). → https://arxiv.org/abs/2212.09251
Model Evaluation for Extreme Risks — Shevlane et al. (2023). The dangerous-capability eval agenda; the cleanest bridge from technical work into policy. → https://arxiv.org/abs/2305.15324
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — Hubinger et al. (2024). Deception that survives safety training — unsettling and important. → https://arxiv.org/abs/2401.05566

Interpretability (looking inside the model)

The reading spine for Interpretability. Read these roughly in order — each builds on the last.

A Mathematical Framework for Transformer Circuits — Elhage et al. (2021). The foundation of mechanistic interp. → https://transformer-circuits.pub/
In-context Learning and Induction Heads — Olsson et al. (2022). A concrete, beautiful result about how models learn in-context. → https://arxiv.org/abs/2209.11895
Toy Models of Superposition — Elhage et al. (2022). Why single neurons mean many things. → https://transformer-circuits.pub/
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning — Bricken et al. (2023). Sparse autoencoders enter the chat. → https://transformer-circuits.pub/
Scaling Monosemanticity — Templeton et al. (2024). The same idea, scaled to a production model. → https://transformer-circuits.pub/
200 Concrete Open Problems in Mechanistic Interpretability — Neel Nanda (2022). Less a paper, more a project-idea vending machine. Pick one and start. → https://www.neelnanda.io/

Governance & policy

For Policy and Governance.

Frontier AI Regulation: Managing Emerging Risks to Public Safety — Anderljung et al. (2023). The reference point for the current regulation conversation. → https://arxiv.org/abs/2307.03718
Model Evaluation for Extreme Risks — Shevlane et al. (2023) (also listed above). The technical hook that most serious policy work hangs on. → https://arxiv.org/abs/2305.15324

If you only do one thing with this list

Don’t just read. Pick one paper, replicate the smallest result in it or write a clear explainer of it, and publish that. One concrete artifact built off a real paper beats a reading list a mile long. That’s the whole move. (The portfolio mindset)

Back to Home · Reading and Courses · Tracks Overview

A Field Guide to AI Fellowships

Explorer

Key Papers

Key Papers

Start here (foundations — everyone reads these)

How models get aligned to humans (RLHF & friends)

Why safety is actually hard (the conceptual core)

Scalable oversight & control (supervising systems near or above our level)

Evals & failure modes (measuring danger)

Interpretability (looking inside the model)

Governance & policy

Graph View

Table of Contents

Backlinks