Key Papers
The specific stuff. Most “how to get into AI” guides stay vague — here are the actual papers I’d hand someone, grouped so you can read in an order that builds. You do not need to read all of these, and you definitely don’t need to understand every equation. Read the ones that match the track pulling at you, and read them the smart way (skim first — Reading papers).
How to use this list
Pick your track, read the 3–4 papers under it plus the two foundations, and write one paragraph on each in your own words. That paragraph is worth more than ten passive reads — and it’s the start of a portfolio. (The portfolio mindset)
About the links
Interpretability papers live on transformer-circuits.pub; the rest are on arXiv. If a link ever rots, just search the exact title — every paper here is famous enough to be the top result.
Start here (foundations — everyone reads these)
- Attention Is All You Need — Vaswani et al. (2017). The transformer. Everything downstream is built on this one. → https://arxiv.org/abs/1706.03762
- Language Models are Few-Shot Learners — Brown et al. (2020). GPT-3; the paper that made scaling impossible to ignore. → https://arxiv.org/abs/2005.14165
- On the Opportunities and Risks of Foundation Models — Bommasani et al. (2021). The big-picture map of the whole era, capabilities and risks together. → https://arxiv.org/abs/2108.07258
How models get aligned to humans (RLHF & friends)
For Research Engineer, Applied and Product ML, Alignment and AI Safety.
- Deep Reinforcement Learning from Human Preferences — Christiano et al. (2017). The seed of RLHF. → https://arxiv.org/abs/1706.03741
- Training language models to follow instructions with human feedback — Ouyang et al. (2022). InstructGPT — RLHF at scale, the recipe behind every modern chat model. → https://arxiv.org/abs/2203.02155
- Constitutional AI: Harmlessness from AI Feedback — Bai et al. (2022). Using AI feedback and a written “constitution” instead of only human labels. → https://arxiv.org/abs/2212.08073
Why safety is actually hard (the conceptual core)
For Alignment and AI Safety, and anyone who wants to sound like they understand the problem.
- Concrete Problems in AI Safety — Amodei, Olah, et al. (2016). The field’s founding problem list. Still the single best on-ramp. → https://arxiv.org/abs/1606.06565
- Unsolved Problems in ML Safety — Hendrycks et al. (2021). A more modern, concrete research agenda. → https://arxiv.org/abs/2109.13916
- Risks from Learned Optimization in Advanced ML Systems — Hubinger et al. (2019). Where “mesa-optimization” / inner alignment comes from. Dense but foundational. → https://arxiv.org/abs/1906.01820
Scalable oversight & control (supervising systems near or above our level)
The deep end of Alignment and AI Safety.
- AI Safety via Debate — Irving et al. (2018). → https://arxiv.org/abs/1805.00899
- Supervising Strong Learners by Amplifying Weak Experts — Christiano et al. (2018). Iterated amplification. → https://arxiv.org/abs/1810.08575
- Measuring Progress on Scalable Oversight for LLMs — Bowman et al. (2022). → https://arxiv.org/abs/2211.03540
- Weak-to-Strong Generalization — Burns et al. (2023). Can a weak supervisor elicit a strong model’s full ability? → https://arxiv.org/abs/2312.09390
- AI Control: Improving Safety Despite Intentional Subversion — Greenblatt et al. (2023). Safety even if the model is trying to subvert you — a big recent reframing. → https://arxiv.org/abs/2312.06942
Evals & failure modes (measuring danger)
For anyone interested in Alignment and AI Safety evals or Policy and Governance.
- Discovering Language Model Behaviors with Model-Written Evaluations — Perez et al. (2022). → https://arxiv.org/abs/2212.09251
- Model Evaluation for Extreme Risks — Shevlane et al. (2023). The dangerous-capability eval agenda; the cleanest bridge from technical work into policy. → https://arxiv.org/abs/2305.15324
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — Hubinger et al. (2024). Deception that survives safety training — unsettling and important. → https://arxiv.org/abs/2401.05566
Interpretability (looking inside the model)
The reading spine for Interpretability. Read these roughly in order — each builds on the last.
- A Mathematical Framework for Transformer Circuits — Elhage et al. (2021). The foundation of mechanistic interp. → https://transformer-circuits.pub/
- In-context Learning and Induction Heads — Olsson et al. (2022). A concrete, beautiful result about how models learn in-context. → https://arxiv.org/abs/2209.11895
- Toy Models of Superposition — Elhage et al. (2022). Why single neurons mean many things. → https://transformer-circuits.pub/
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning — Bricken et al. (2023). Sparse autoencoders enter the chat. → https://transformer-circuits.pub/
- Scaling Monosemanticity — Templeton et al. (2024). The same idea, scaled to a production model. → https://transformer-circuits.pub/
- 200 Concrete Open Problems in Mechanistic Interpretability — Neel Nanda (2022). Less a paper, more a project-idea vending machine. Pick one and start. → https://www.neelnanda.io/
Governance & policy
- Frontier AI Regulation: Managing Emerging Risks to Public Safety — Anderljung et al. (2023). The reference point for the current regulation conversation. → https://arxiv.org/abs/2307.03718
- Model Evaluation for Extreme Risks — Shevlane et al. (2023) (also listed above). The technical hook that most serious policy work hangs on. → https://arxiv.org/abs/2305.15324
If you only do one thing with this list
Don’t just read. Pick one paper, replicate the smallest result in it or write a clear explainer of it, and publish that. One concrete artifact built off a real paper beats a reading list a mile long. That’s the whole move. (The portfolio mindset)
Back to Home · Reading and Courses · Tracks Overview