Publications
2025
- Iterative label refinement matters more than preference optimization under weak supervision. , , . arXiv, 2025. [bib] [paper]
2024
- LatentQA: teaching LLMs to decode activations into natural language. , , . arXiv, 2024. [bib] [paper]
- Extractive structures learned in pretraining enable generalization on finetuned facts. , , . arXiv, 2024. [bib] [paper]
- What do learning dynamics reveal about generalization in LLM reasoning? , , , , , , . arXiv, 2024. [bib] [paper]
- VibeCheck: discover and quantify qualitative differences in large language models. , , , , . arXiv, 2024. [bib] [paper]
- Explaining datasets in words: statistical models with natural language parameters. , , , . arXiv, 2024. [bib] [paper]
- Safety vs. performance: how multi-objective learning reduces barriers to market entry. , , . arXiv, 2024. [bib] [paper]
- Feedback loops with language models drive in-context reward hacking. , , , . International Conference on Machine Learning (ICML), 2024. [bib] [paper]
- Do models explain themselves? counterfactual simulatability of natural language explanations. , , , , , , , . International Conference on Machine Learning (ICML), 2024. Spotlight presentation. [bib] [paper]
- Monitoring latent world states in language models with propositional probes. , , . arXiv, 2024. [bib] [paper]
- Adversaries can misuse combinations of safe models. , , . arXiv, 2024. [bib] [paper]
- Interpreting the second-order effects of neurons in CLIP. , , . arXiv, 2024. [bib] [paper]
- Overthinking the truth: understanding how language models process false demonstrations. , , . International Conference on Learning Representations (ICLR), 2024. Spotlight presentation. [bib] [paper]
- Protein language models are biased by unequal sequence sampling across the tree of life. , . arXiv, 2024. [bib] [paper]
- Approaching human-level forecasting with language models. , , , . arXiv, 2024. [bib] [paper]
- Describing differences in image sets with natural language. , , , , , , , . Computer Vision and Pattern Recognition (CVPR), 2024. Oral presentation. [bib] [paper]
- How do language models bind entities in context? , . International Conference on Learning Representations (ICLR), 2024. [bib] [paper]
- Interpreting CLIP's image representation via text-based decomposition. , , . International Conference on Learning Representations (ICLR), 2024. Oral presentation. [bib] [paper]
2023
- Learning equilibria in matching markets from bandit feedback. , , , , . Journal of the ACM (JACM), 2023. [bib] [paper]
- Mass-Producing failures of multimodal systems with language models. , , . Advances in Neural Information Processing Systems (NeurIPS), 2023. [bib] [paper]
- Jailbroken: how does LLM safety training fail? , , . Advances in Neural Information Processing Systems (NeurIPS), 2023. Oral presentation. [bib] [paper]
- Supply-Side equilibria in recommender systems. , , . Advances in Neural Information Processing Systems (NeurIPS), 2023. [bib] [paper]
- Improved bayes risk can yield reduced social welfare under competition. , , , . Advances in Neural Information Processing Systems (NeurIPS), 2023. [bib] [paper]
- Are neurons actually collapsed? on the fine-grained structure in neural representations. , , . International Conference on Machine Learning (ICML), 2023. [bib] [paper]
- Automatically auditing large language models via discrete optimization. , , , . International Conference on Machine Learning (ICML), 2023. [bib] [paper]
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. , , , , . International Conference on Learning Representations (ICLR), 2023. [bib] [paper]
- Progress measures for grokking via mechanistic interpretability. , , , , . International Conference on Learning Representations (ICLR), 2023. Oral presentation. [bib] [paper]
- Incentivizing high-quality content in online recommender systems. , , , . arXiv, 2023. [bib] [paper]
- Eliciting latent predictions from transformers with the tuned lens. , , , , , , , . arXiv, 2023. [bib] [paper]
- Goal driven discovery of distributional differences via language descriptions. , , , , , . Advances in Neural Information Processing Systems (NeurIPS), 2023. [bib] [paper]
- Discovering latent knowledge in language models without supervision. , , , . International Conference on Learning Representations (ICLR), 2023. [bib] [paper]
2022
- Generalized resilience and robust statistics. , , . Annals of Statistics, 2022. [bib] [paper] [talk]
- How would the viewer feel? estimating wellbeing from video scenarios. , , , , , , , , . Advances in Neural Information Processing Systems (NeurIPS), 2022. [bib] [paper]
- Capturing failures of large language models via human cognitive biases. , . Advances in Neural Information Processing Systems (NeurIPS), 2022. [bib] [paper]
- A3D: studying pretrained representations with programmable datasets. , , , , . IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022. [bib] [paper]
- Forecasting future world events with neural networks. , , , , , , , , , . Advances in Neural Information Processing Systems (NeurIPS), 2022. [bib] [paper]
- Auditing visualizations: transparency methods struggle to detect anomalous behavior. , . arXiv, 2022. [bib] [paper]
- More than a toy: random matrix models predict how real-world neural representations generalize. , , . International Conference on Machine Learning (ICML), 2022. [bib] [paper]
- Predicting out-of-distribution error with the projection norm. , , , , . International Conference on Machine Learning (ICML), 2022. [bib] [paper]
- Describing differences between text distributions with natural language. , , , . International Conference on Machine Learning (ICML), 2022. [bib] [paper]
- The effects of reward misspecification: mapping and mitigating misaligned models. , , . International Conference on Learning Representations (ICLR), 2022. [bib] [paper]
- PixMix: dreamlike pictures comprehensively improve safety measures. , , , , , . Computer Vision and Pattern Recognition (CVPR), 2022. [bib] [paper] [code]
- Stronger data poisoning attacks break data sanitization defenses. , , . Machine Learning, 2022. [bib] [paper] [code]
- Scaling out-of-distribution detection for real-world settings. , , , , , . International Conference on Machine Learning (ICML), 2022. [bib] [paper] [data]
2021
- The effect of model size on worst-group generalization. , , , , , , , , . Advances in Neural Information Processing Systems Workshop (NeurIPS Workshop), 2021. [bib] [paper]
- What would jiminy cricket do? Towards agents that behave morally. , , , , , , , , . Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. [bib] [paper] [code]
- Unsolved problems in ML safety. , , , . arXiv, 2021. [bib] [paper] [blog post]
- Grounding representation similarity with statistical testing. , , . Advances in Neural Information Processing Systems (NeurIPS), 2021. [bib] [paper]
- Constructing and adjusting estimates for household transmission of SARS-CoV-2 from prior studies, widespread-testing and contact-tracing data. , , , . International Journal of Epidemiology, 2021. [bib] [preprint] [journal]
- Robust estimation via generalized quasi-gradients. , , . Information and Inference: A Journal of the IMA, 2021. [bib] [preprint] [journal]
- Measuring coding challenge competence with APPS. , , , , , , , , , , . Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. [bib] [paper] [data]
- Are larger pretrained language models uniformly better? Comparing performance at the instance level. , , , . Findings of the Association for Computational Linguistics (Findings of ACL), 2021. [bib] [paper]
- Agnostic learning with unknown utilities. , , , . Innovations in Theoretical Computer Science (ITCS), 2021. [bib] [paper] [talk]
- Measuring massive multitask language understanding. , , , , , , . International Conference on Learning Representations (ICLR), 2021. [bib] [paper] [data]
- Aligning AI with shared human values. , , , , , , . International Conference on Learning Representations (ICLR), 2021. [bib] [paper] [data]
- The many faces of robustness: a critical analysis of out-of-distribution generalization. , , , , , , , , , , , , . International Conference on Computer Vision (ICCV), 2021. [bib] [paper] [code]
- Natural adversarial examples. , , , , . Computer Vision and Pattern Recognition (CVPR), 2021. [bib] [paper] [data]
- Approximating how single head attention learns. , , , . arXiv, 2021. [bib] [paper]
- Limitations of post-hoc feature alignment for robustness. , . Computer Vision and Pattern Recognition (CVPR), 2021. [bib] [paper]
- Measuring mathematical problem solving with the MATH dataset. , , , , , , , . Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. [bib] [paper] [code]
- Understanding generalization in adversarial training via the bias-variance decomposition. , , , , . arXiv, 2021. [bib] [paper]
2020
- Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. , , , , , , , , , , . Advances in Neural Information Processing Systems (NeurIPS), 2020. [bib] [paper] [blog post] [CodaLab]
- How data science can ease the COVID-19 pandemic. , . TechStream, 2020. [bib] [paper]
- Why robustness is key to deploying AI. , . TechStream, 2020. [bib] [paper]
- Prevalence tracking mechanisms for SARS-CoV-2. , . preprint, 2020. [bib] [paper]
- Estimation of SARS-CoV-2 infection prevalence in Santa Clara County. , , . preprint, 2020. [bib] [paper] [code] [blog post]
- Identifying statistical bias in dataset replication. , , , , , . International Conference on Machine Learning (ICML), 2020. [bib] [paper] [code] [blog post]
- Rethinking bias-variance trade-off for generalization of neural networks. , , , , . International Conference on Machine Learning (ICML), 2020. [bib] [paper]
- When does the tukey median work? , , . IEEE International Symposium on Information Theory (ISIT), 2020. [bib] [paper]
2019
- Testing robustness against unforeseen adversaries. , , , , . arXiv, 2019. [bib] [paper] [reviews] [code]
- Sever: a robust meta-algorithm for stochastic optimization. , , , , , . International Conference on Machine Learning (ICML), 2019. [bib] [paper] [code]
- Troubling trends in machine learning scholarship. , . ACM Queue, 2019. [bib] [paper]
- FrAngel: component-based synthesis with control structures. , , . Principles of Programming Languages (POPL), 2019. [bib] [paper] [CodaLab]
2018
- Better agnostic clustering via relaxed tensor norms. , . Symposium on Theory of Computing (STOC), 2018. [bib] [paper]
- Resilience: a criterion for learning in the presence of arbitrary outliers. , , . Innovations in Theoretical Computer Science (ITCS), 2018. [bib] [paper] [talk] [slides]
- Robust learning: information theory and algorithms. . Stanford University, 2018. [bib] [thesis]
- The malicious use of artificial intelligence: forecasting, prevention, and mitigation. , , , , , , , , , , . arXiv, 2018. [bib] [paper]
- Semidefinite relaxations for certifying robustness to adversarial examples. , , . Advances in Neural Information Processing Systems (NeurIPS), 2018. [bib] [paper]
- Certified defenses against adversarial examples. , , . International Conference on Learning Representations (ICLR), 2018. [bib] [paper] [reviews] [CodaLab]
2017
- Learning from untrusted data. , , . Symposium on Theory of Computing (STOC), 2017. [bib] [paper] [talk] [slides] [poster]
- Does robustness imply tractability? A lower bound for planted clique in the semi-random model. . arXiv, 2017. [bib] [paper]
- Certified defenses for data poisoning attacks. , , . Advances in Neural Information Processing Systems (NeurIPS), 2017. [bib] [paper] [code] [CodaLab]
2016
- Memory, communication, and statistical queries. , , . Conference on Learning Theory (COLT), 2016. [bib] [paper]
- Avoiding imposters and delinquents: adversarial crowdsourcing and peer prediction. , , . Advances in Neural Information Processing Systems (NeurIPS), 2016. [bib] [paper]
- Concrete problems in AI safety. , , , , , . arXiv, 2016. [bib] [paper]
- Unsupervised risk estimation using only conditional independence structure. , . Advances in Neural Information Processing Systems (NeurIPS), 2016. [bib] [paper]
2015
- Minimax rates for memory-bounded sparse linear regression. , . Conference on Learning Theory (COLT), 2015. [bib] [paper] [talk] [slides] [poster]
- Learning with relaxed supervision. , . Advances in Neural Information Processing Systems (NeurIPS), 2015. [bib] [paper] [poster] [CodaLab]
- Reified context models. , . International Conference on Machine Learning (ICML), 2015. [bib] [paper] [talk] [slides] [poster] [CodaLab]
- Learning fast-mixing models for structured prediction. , . International Conference on Machine Learning (ICML), 2015. [bib] [paper] [talk] [slides] [poster] [CodaLab]
- Learning where to sample in structured prediction. , , . Artificial Intelligence and Statistics (AISTATS), 2015. [bib] [paper] [slides] [CodaLab]
2014
- The statistics of streaming sparse regression. , , . arXiv preprint arXiv:1412.4182, 2014. [bib] [paper]
- Adaptivity and optimism: an improved exponentiated gradient algorithm. , . International Conference on Machine Learning (ICML), 2014. [bib] [paper] [slides] [poster]
- Filtering with abstract particles. , . International Conference on Machine Learning (ICML), 2014. [bib] [paper] [supplemental material] [slides] [poster]
2012
- Flexible martingale priors for deep hierarchies. , . Artificial Intelligence and Statistics (AISTATS), 2012. [bib] [paper] [slides] [poster]
2011
- Finite-time regional verification of stochastic nonlinear systems. , . Robotics: Science and Systems (RSS), 2011. [bib] [paper] [journal] [slides] [poster]
- Pathological properties of deep bayesian hierarchies. , . NIPS Workshop on Bayesian Nonparametrics, 2011. [bib] [paper] [poster]
2010
- Permutations with ascending and descending blocks. . Electronic Journal of Combinatorics, 2010. [bib] [paper] [slides]
2009
- On coloring the odd-distance graph. . Electronic Journal of Combinatorics, 2009. [bib] [paper]
2007
- Cayley graphs formed by conjugate generating sets of s_n. . preprint, 2007. [bib] [paper]