Publications

2026

Learning a generative meta-model of LLM activations. Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt. arXiv, 2026. [bib] [paper]

Language model circuits are sparse in the neuron basis. Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann. arXiv, 2026. [bib] [paper]

2025

Predictive concept decoders: training scalable end-to-end interpretability assistants. Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt. arXiv, 2025. [bib] [paper]

Training language models to explain their own computations. Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas. arXiv, 2025. [bib] [paper]

LLM layers immediately correct each other. Arjun Patrawala, Jiahai Feng, Erik Jones, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2025.[bib] [paper]

Establishing best practices for building rigorous agentic benchmarks. Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang. arXiv, 2025. [bib] [paper]

Understanding in-context learning of addition via activation subspaces. Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen. arXiv, 2025. [bib] [paper]

Uncovering gaps in how humans and LLMs interpret subjective language. Erik Jones, Arjun Patrawala, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2025. Spotlight presentation.[bib] [paper]

Eliciting language model behaviors with investigator agents. Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt. arXiv, 2025. [bib] [paper]

Iterative label refinement matters more than preference optimization under weak supervision. Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt. arXiv, 2025. [bib] [paper]

2024

LatentQA: teaching LLMs to decode activations into natural language. Alexander Pan, Lijie Chen, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Extractive structures learned in pretraining enable generalization on finetuned facts. Jiahai Feng, Stuart Russell, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

What do learning dynamics reveal about generalization in LLM reasoning? Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar. arXiv, 2024. [bib] [paper]

VibeCheck: discover and quantify qualitative differences in large language models. Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez. arXiv, 2024. [bib] [paper]

Explaining datasets in words: statistical models with natural language parameters. Ruiqi Zhong, Heng Wang, Dan Klein, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Safety vs. performance: how multi-objective learning reduces barriers to market entry. Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Feedback loops with language models drive in-context reward hacking. Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt. International Conference on Machine Learning (ICML), 2024.[bib] [paper]

Do models explain themselves? counterfactual simulatability of natural language explanations. Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown. International Conference on Machine Learning (ICML), 2024. Spotlight presentation.[bib] [paper]

Covert malicious finetuning: challenges in safeguarding LLM adaptation. Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Monitoring latent world states in language models with propositional probes. Jiahai Feng, Stuart Russell, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Adversaries can misuse combinations of safe models. Erik Jones, Anca Dragan, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Interpreting the second-order effects of neurons in CLIP. Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Overthinking the truth: understanding how language models process false demonstrations. Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2024. Spotlight presentation.[bib] [paper]

Protein language models are biased by unequal sequence sampling across the tree of life. Frances Ding, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Approaching human-level forecasting with language models. Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt. arXiv, 2024. [bib] [paper]

Describing differences in image sets with natural language. Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy. Computer Vision and Pattern Recognition (CVPR), 2024. Oral presentation.[bib] [paper]

How do language models bind entities in context? Jiahai Feng, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2024.[bib] [paper]

Interpreting CLIP's image representation via text-based decomposition. Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2024. Oral presentation.[bib] [paper]

2023

Learning equilibria in matching markets from bandit feedback. Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael I. Jordan, Jacob Steinhardt. Journal of the ACM (JACM), 2023.[bib] [paper]

Mass-Producing failures of multimodal systems with language models. Shengbang Tong, Erik Jones, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2023.[bib] [paper]

Jailbroken: how does LLM safety training fail? Alexander Wei, Nika Haghtalab, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2023. Oral presentation.[bib] [paper]

Supply-Side equilibria in recommender systems. Meena Jagadeesan, Nikhil Garg, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2023.[bib] [paper]

Improved bayes risk can yield reduced social welfare under competition. Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt, Nika Haghtalab. Advances in Neural Information Processing Systems (NeurIPS), 2023.[bib] [paper]

Are neurons actually collapsed? on the fine-grained structure in neural representations. Yongyi Yang, Jacob Steinhardt, Wei Hu. International Conference on Machine Learning (ICML), 2023.[bib] [paper]

Automatically auditing large language models via discrete optimization. Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt. International Conference on Machine Learning (ICML), 2023.[bib] [paper]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2023.[bib] [paper]

Progress measures for grokking via mechanistic interpretability. Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2023. Oral presentation.[bib] [paper]

Incentivizing high-quality content in online recommender systems. Xinyan Hu, Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt. arXiv, 2023. [bib] [paper]

Eliciting latent predictions from transformers with the tuned lens. Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt. arXiv, 2023. [bib] [paper]

Goal driven discovery of distributional differences via language descriptions. Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2023.[bib] [paper]

Discovering latent knowledge in language models without supervision. Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2023.[bib] [paper]

2022

Generalized resilience and robust statistics. Banghua Zhu, Jiantao Jiao, Jacob Steinhardt. Annals of Statistics, 2022.[bib] [paper] [talk]

How would the viewer feel? estimating wellbeing from video scenarios. Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks. Advances in Neural Information Processing Systems (NeurIPS), 2022.[bib] [paper]

Capturing failures of large language models via human cognitive biases. Erik Jones, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2022.[bib] [paper]

A3D: studying pretrained representations with programmable datasets. Ye Wang, Norman Mu, Daniele Grandi, Nicolas Savva, Jacob Steinhardt. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022.[bib] [paper]

Forecasting future world events with neural networks. Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks. Advances in Neural Information Processing Systems (NeurIPS), 2022.[bib] [paper]

Auditing visualizations: transparency methods struggle to detect anomalous behavior. Jean-Stanislas Denain, Jacob Steinhardt. arXiv, 2022. [bib] [paper]

More than a toy: random matrix models predict how real-world neural representations generalize. Alexander Wei, Wei Hu, Jacob Steinhardt. International Conference on Machine Learning (ICML), 2022.[bib] [paper]

Predicting out-of-distribution error with the projection norm. Yaodong Yu, Zitong Yang, Alexander Wei, Yi Ma, Jacob Steinhardt. International Conference on Machine Learning (ICML), 2022.[bib] [paper]

Describing differences between text distributions with natural language. Ruiqi Zhong, Charlie Snell, Dan Klein, Jacob Steinhardt. International Conference on Machine Learning (ICML), 2022.[bib] [paper]

The effects of reward misspecification: mapping and mitigating misaligned models. Alexander Pan, Kush Bhatia, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2022.[bib] [paper]

PixMix: dreamlike pictures comprehensively improve safety measures. Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Dawn Song, Jacob Steinhardt. Computer Vision and Pattern Recognition (CVPR), 2022.[bib] [paper] [code]

Stronger data poisoning attacks break data sanitization defenses. Pang Wei Koh, Jacob Steinhardt, Percy Liang. Machine Learning, 2022.[bib] [paper] [code]

Scaling out-of-distribution detection for real-world settings. Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song. International Conference on Machine Learning (ICML), 2022.[bib] [paper] [data]

2021

The effect of model size on worst-group generalization. Alan Le Pham, Eunice Chan, Vikranth Srivatsa, Dhruba Ghosh, Yaoqing Yang, Yaodong Yu, Ruiqi Zhong, Joseph E Gonzalez, Jacob Steinhardt. Advances in Neural Information Processing Systems Workshop (NeurIPS Workshop), 2021.[bib] [paper]

What would jiminy cricket do? Towards agents that behave morally. Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt. Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021.[bib] [paper] [code]

Unsolved problems in ML safety. Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt. arXiv, 2021. [bib] [paper] [blog post]

Grounding representation similarity with statistical testing. Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt. Advances in Neural Information Processing Systems (NeurIPS), 2021.[bib] [paper]

Constructing and adjusting estimates for household transmission of SARS-CoV-2 from prior studies, widespread-testing and contact-tracing data. Mihaela Curmei*, Andrew Ilyas*, Owain Evans, Jacob Steinhardt. International Journal of Epidemiology, 2021.[bib] [preprint] [journal]

Robust estimation via generalized quasi-gradients. Banghua Zhu, Jiantao Jiao, Jacob Steinhardt. Information and Inference: A Journal of the IMA, 2021.[bib] [preprint] [journal]

Measuring coding challenge competence with APPS. Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt. Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021.[bib] [paper] [data]

Are larger pretrained language models uniformly better? Comparing performance at the instance level. Ruiqi Zhong, Dhruba Ghosh, Dan Klein, Jacob Steinhardt. Findings of the Association for Computational Linguistics (Findings of ACL), 2021.[bib] [paper]

Agnostic learning with unknown utilities. Kush Bhatia, Peter L. Bartlett, Anca D. Dragan, Jacob Steinhardt. Innovations in Theoretical Computer Science (ITCS), 2021.[bib] [paper] [talk]

Measuring massive multitask language understanding. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2021.[bib] [paper] [data]

Aligning AI with shared human values. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt. International Conference on Learning Representations (ICLR), 2021.[bib] [paper] [data]

The many faces of robustness: a critical analysis of out-of-distribution generalization. Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer. International Conference on Computer Vision (ICCV), 2021.[bib] [paper] [code]

Natural adversarial examples. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn Song. Computer Vision and Pattern Recognition (CVPR), 2021.[bib] [paper] [data]

Approximating how single head attention learns. Charlie Snell, Ruiqi Zhong, Dan Klein, Jacob Steinhardt. arXiv, 2021. [bib] [paper]

Limitations of post-hoc feature alignment for robustness. Collin Burns, Jacob Steinhardt. Computer Vision and Pattern Recognition (CVPR), 2021.[bib] [paper]

Measuring mathematical problem solving with the MATH dataset. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021.[bib] [paper] [code]

Understanding generalization in adversarial training via the bias-variance decomposition. Yaodong Yu, Zitong Yang, Edgar Dobriban, Jacob Steinhardt, Yi Ma. arXiv, 2021. [bib] [paper]

2020

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang, Pushmeet Kohli. Advances in Neural Information Processing Systems (NeurIPS), 2020.[bib] [paper] [blog post] [CodaLab]

How data science can ease the COVID-19 pandemic. Nigam Shah, Jacob Steinhardt. TechStream, 2020.[bib] [paper]

Why robustness is key to deploying AI. Jacob Steinhardt, Helen Toner. TechStream, 2020.[bib] [paper]

Prevalence tracking mechanisms for SARS-CoV-2. Jacob Steinhardt, Andrew Ilyas. preprint, 2020.[bib] [paper]

Estimation of SARS-CoV-2 infection prevalence in Santa Clara County. Steve Yadlowsky, Nigam Shah, Jacob Steinhardt. preprint, 2020.[bib] [paper] [code] [blog post]

Identifying statistical bias in dataset replication. Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, Aleksander Madry. International Conference on Machine Learning (ICML), 2020.[bib] [paper] [code] [blog post]

Rethinking bias-variance trade-off for generalization of neural networks. Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, Yi Ma. International Conference on Machine Learning (ICML), 2020.[bib] [paper]

When does the tukey median work? Banghua Zhu, Jiantao Jiao, Jacob Steinhardt. IEEE International Symposium on Information Theory (ISIT), 2020.[bib] [paper]

2019

Testing robustness against unforeseen adversaries. Daniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, Jacob Steinhardt. arXiv, 2019. [bib] [paper] [reviews] [code]

Sever: a robust meta-algorithm for stochastic optimization. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, Alistair Stewart. International Conference on Machine Learning (ICML), 2019.[bib] [paper] [code]

Troubling trends in machine learning scholarship. Zachary C. Lipton, Jacob Steinhardt. ACM Queue, 2019.[bib] [paper]

FrAngel: component-based synthesis with control structures. Kensen Shi, Jacob Steinhardt, Percy Liang. Principles of Programming Languages (POPL), 2019.[bib] [paper] [CodaLab]

2018

Better agnostic clustering via relaxed tensor norms. Pravesh Kothari, Jacob Steinhardt. Symposium on Theory of Computing (STOC), 2018.[bib] [paper]

Resilience: a criterion for learning in the presence of arbitrary outliers. Jacob Steinhardt, Moses Charikar, Gregory Valiant. Innovations in Theoretical Computer Science (ITCS), 2018.[bib] [paper] [talk] [slides]

Robust learning: information theory and algorithms. Jacob Steinhardt. Stanford University, 2018.[bib] [thesis]

The malicious use of artificial intelligence: forecasting, prevention, and mitigation. Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, others. arXiv, 2018. [bib] [paper]

Semidefinite relaxations for certifying robustness to adversarial examples. Aditi Raghunathan, Jacob Steinhardt, Percy Liang. Advances in Neural Information Processing Systems (NeurIPS), 2018.[bib] [paper]

Certified defenses against adversarial examples. Aditi Raghunathan, Jacob Steinhardt, Percy Liang. International Conference on Learning Representations (ICLR), 2018.[bib] [paper] [reviews] [CodaLab]

2017

Learning from untrusted data. Moses Charikar, Jacob Steinhardt, Gregory Valiant. Symposium on Theory of Computing (STOC), 2017.[bib] [paper] [talk] [slides] [poster]

Does robustness imply tractability? A lower bound for planted clique in the semi-random model. Jacob Steinhardt. arXiv, 2017. [bib] [paper]

Certified defenses for data poisoning attacks. Jacob Steinhardt, Pang Wei Koh, Percy Liang. Advances in Neural Information Processing Systems (NeurIPS), 2017.[bib] [paper] [code] [CodaLab]

2016

Memory, communication, and statistical queries. Jacob Steinhardt, Gregory Valiant, Stefan Wager. Conference on Learning Theory (COLT), 2016.[bib] [paper]

Avoiding imposters and delinquents: adversarial crowdsourcing and peer prediction. Jacob Steinhardt, Gregory Valiant, Moses Charikar. Advances in Neural Information Processing Systems (NeurIPS), 2016.[bib] [paper]

Concrete problems in AI safety. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané. arXiv, 2016. [bib] [paper]

Unsupervised risk estimation using only conditional independence structure. Jacob Steinhardt, Percy Liang. Advances in Neural Information Processing Systems (NeurIPS), 2016.[bib] [paper]

2015

Minimax rates for memory-bounded sparse linear regression. Jacob Steinhardt, John Duchi. Conference on Learning Theory (COLT), 2015.[bib] [paper] [talk] [slides] [poster]

Learning with relaxed supervision. Jacob Steinhardt, Percy Liang. Advances in Neural Information Processing Systems (NeurIPS), 2015.[bib] [paper] [poster] [CodaLab]

Reified context models. Jacob Steinhardt, Percy Liang. International Conference on Machine Learning (ICML), 2015.[bib] [paper] [talk] [slides] [poster] [CodaLab]

Learning fast-mixing models for structured prediction. Jacob Steinhardt, Percy Liang. International Conference on Machine Learning (ICML), 2015.[bib] [paper] [talk] [slides] [poster] [CodaLab]

Learning where to sample in structured prediction. Tianlin Shi, Jacob Steinhardt, Percy Liang. Artificial Intelligence and Statistics (AISTATS), 2015.[bib] [paper] [slides] [CodaLab]

2014

The statistics of streaming sparse regression. Jacob Steinhardt, Stefan Wager, Percy Liang. arXiv preprint arXiv:1412.4182, 2014. [bib] [paper]

Adaptivity and optimism: an improved exponentiated gradient algorithm. Jacob Steinhardt, Percy Liang. International Conference on Machine Learning (ICML), 2014.[bib] [paper] [slides] [poster]

Filtering with abstract particles. Jacob Steinhardt, Percy Liang. International Conference on Machine Learning (ICML), 2014.[bib] [paper] [supplemental material] [slides] [poster]

2012

Flexible martingale priors for deep hierarchies. Jacob Steinhardt, Zoubin Ghahramani. Artificial Intelligence and Statistics (AISTATS), 2012.[bib] [paper] [slides] [poster]