In machine learning, we are obsessed with datasets and metrics: progress in areas as diverse as natural language understanding, object recognition, and reinforcement learning is tracked by numerical scores on agreed-upon benchmarks. Despite this, I think we focus too little on measurement—that is, on ways of extracting data from machine learning models that bears upon important hypotheses. This might sound paradoxical, since benchmarks are after all one way of measuring a model. However, benchmarks are a very narrow form of measurement, and I will argue below that trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning. I’ll argue this in three ways:
- Historically, more measurement has almost always been a great move, not only in science but also in engineering and policymaking.
- Philosophically, measurement has many good properties that bear upon important questions in ML.
- In my own research, just measuring something and seeing what happened has often been surprisingly fruitful.
Once I’ve sold you on measurement in general, I’ll apply it to a particular case: the level of optimization power that a machine learning model has. Optimization power is an intuitively useful concept, often used when discussing risks from AI systems, but as far as I know no one has tried to measure it or even say what it would mean to measure it in principle. We’ll explore how to do this and see that there are at least two distinct measurable aspects of optimization power that have different implications (one on how misaligned an agent’s actions might be, and the other on how quickly an agent’s capabilities might grow). This shows by example that even thinking about how to measure something can be a helpful clarifying exercise, and suggests future directions of research that I discuss at the end.
Measurement Is Great
Above I defined measurement as extracting data from [a system], such that the data bears upon an important hypothesis. Examples of this would be looking at things under a microscope, performing crystallography, collecting survey data, observing health outcomes in clinical trials, or measuring household income one year after running a philanthropic program. In machine learning, it would include measuring the accuracy of models on a dataset, measuring the variability of models across random seeds, visualizing neural network representations, and computing the influence of training data points on model predictions.
I think we should measure far more things than we do. In fact, when thinking about almost any empirical research question, one of my early thoughts is Can I in principle measure something which would tell me the answer to this question, or at least get me started? For an engineering goal, often it’s enough to measure the extent to which the goal has not yet been met (we can then perform gradient descent on that measure). For a scientific or causal question, thinking about an in-principle (but potentially infeasible) measurement that would answer the question can help clarify what we actually want and guide the design of a viable experiment. I think asking yourself this question pretty much all the time would probably make you a stronger researcher.
Below I’ll argue in more detail why measurement is so great. I’ll start by showing that it’s historically been a great idea, then offer philosophical arguments in its favor, and finally give instances where it’s been helpful in my own research.
Throughout the history of science, measuring things (often even in undirected ways) has repeatedly proven useful. Looking under a microscope revealed red blood cells, spermatozoa, and micro-organisms. X-ray crystallography helped drive many of the developments in molecular biology, underscored by the following quote from Francis Crick (emphasis mine):
But then, as you know, at the atomic level x-ray crystallography has turned out to be extremely powerful in determining the three-dimensional structure of macromolecules. Especially now, when combined with methods for measuring the intensities automatically and analyzing the data with very fast computers. The list of techniques is not something static—and they’re getting faster all the time. We have a saying in the lab that the difficulty of a project goes from the Nobel-prize level to the master’s-thesis level in ten years!
Indeed, much of the rapid progress in molecular biology was driven by a symbiotic relationship between new discoveries and better measurement, both in crystallography and in the types of assays that could be run as we discovered new ways of manipulating biological structures.
Measurement has also been useful outside of science. In development economics, measuring the impact of foreign aid interventions has upended the entire aid ecosystem, revealing previously popular interventions to be nearly useless while promoting new ones such as anti-malarial bednets. CO2 measurement helped alert us to the dangers of climate change. GPS measurement, initially developed for military applications, is now used for navigation, time synchronization, mining, disaster relief, and atmospheric measurements.
A key property is that many important trends are measurable long before they are viscerally apparent. Indeed, sometimes what feels like a discrete shift is a culmination of sustained exponential growth. One recent example is COVID-19 cases. While for those in major U.S. cities, it feels like “everything happened” in a single week in March, this was actually the culmination of exponential spread that started in December and was relatively clearly measurable at least by late January, even with limited testing. Another example is that market externalities can often be measured long before they become a problem. For instance, when gas power was first introduced, it actually decreased pollution because it was cleaner than coal, which is what it mainly displaced. However, it was also cheaper, and so the supply of gas-powered energy expanded dramatically, leading to greater pollution in the long run. This eventual consequence would have been predictable by measuring the rapid proliferation of gas-powered devices, even during the period where pollution itself had decreased. I find this a compelling analogy when thinking about how to predict unintended consequences of AI.
Measurement has several valuable properties. The first is that considering how to measure something forces one to ground a discussion—seemingly meaningful concepts may be revealed as incoherent if there is no way to measure them even in principal, while intractable disagreements might either vanish or quickly resolve when turned into conflicting disagreements about an empirically measurable outcome.
Within science, being able to measure more properties of a system creates more interlocking constraints that can error-check theories. These interlocking constraints are, I believe, an underappreciated prerequisite for building meaningful scientific theories in the first place. That is, a scientific theory is not just a way of making predictions about an outcome, but presents a self-consistent account of multiple interrelated phenomena. This is why, for instance, exceeding the speed of light or failing to conserve energy would require radically rethinking our theories of physics (this would not be so if science was only about prediction and not the interlocking constraints). Walter Gilbert, Nobel laureate in Chemistry, puts this well (emphasis again mine):
“The major problem really was that when you’re doing experiments in a domain that you do not understand at all, you have no guidance what the experiment should even look like. Experiments come in a number of categories. There are experiments which you can describe entirely, formulate completely so that an answer must emerge; the experiment will show you A or B; both forms of the result will be meaningful; and you understand the world well enough so that there are only those two outcomes. Now, there’s a large other class of experiments that do not have that property. In which you do not understand the world well enough to be able to restrict the answers to the experiment. So you do the experiment, and you stare at it and say, Now does it mean anything, or can it suggest something which I might be able to amplify in further experiment? What is the world really going to look like?”
“The messenger experiments had that property. We did not know what it should look like. And therefore as you do the experiments, you do not know what in the result is your artifact, and what is the phenomenon. There can be a long period in which there is experimentation that circles the topic. But finally you learn how to do experiments in a certain way: you discover ways of doing them that are reproducible, or at least—” He hesitated. “That’s actually a bad way of saying it—bad criterion, if it’s just reproducibility, because you can reproduce artifacts very very well. There’s a larger domain of experiments where the phenomena have to be reproducible and have to be interconnected. Over a large range of variation of parameters, so you believe you understand something.”
This quote was in relation to the discovery of mRNA. The first paragraph of this quote, incidentally, is why I think phrases such as “novel but not surprising” should never appear in a review of a scientific paper as a reason for rejection. When we don’t even have a well-conceptualized outcome space, almost all outcomes will be “unsurprising”, because we don’t have any theories to constrain our expectations. But this is exactly when it is most important to start measuring things!
Finally, measurement often swiftly resolves long-standing debates and misconceptions. We can already see this in machine learning. For distribution shift, many feared that highly accurate neural net models were overfitting their training distribution and would have low out-of-distribution accuracy; but once we measured it, we found that in- and out-of-distribution accuracy were actually strongly correlated. Many also feared that neural network representations were mostly random gibberish that happened to get the answer correct. While adversarial examples show that these representations do have issues, visualizing the representations shows clear semantic structure such as edge filters, at least ruling out the “gibberish” hypothesis.
In both of the preceding cases, measurement also led to a far more nuanced subsequent debate. For robustness to distribution shift, we now focus on what interventions beyond accuracy can improve robustness, and whether these interventions reliably generalize across different types of shift. For neural network representations, we now ask how different layers behave, and what the best way is to extract information from each layer.
This brings us to the success of measurement in some of my own work. I’ve already hinted at it in talking about machine learning robustness. Early robustness benchmarks (not mine) such as ImageNet-C and ImageNet-v2 revealed the strong correlation between in-distribution and out-of-distribution accuracy. They also raised several hypotheses about what might improve robustness (beyond just in-distribution accuracy), such as larger models, more diverse training data, certain types of data augmentation, or certain architecture choices. However, these datasets measured robustness only to two specific families of shifts. Our group then decided to measure robustness to pretty much everything we could think of, including abstract renditions (ImageNet-R), occlusion and viewpoint variation (DeepFashion Remixed), and country and year (StreetView Storefronts). We found that almost every existing hypothesis had to be at least qualified, and we currently have a more nuanced set of hypotheses centered around “texture bias”. Moreover, based on a survey of ourselves and several colleagues, no one predicted the full set of qualitative results ahead of time.
In this particular case I think it was important that many of the distribution shifts were relative to the same training distribution (ImageNet), and the rest were at least in the vision domain. This allowed us to measure multiple phenomena for the same basic object, although I think we’re not yet at Walter Gilbert’s “larger domain of interconnected phenomena”, so more remains to be done.
Another example concerned the folk theory that ML systems might be imbalanced savants—very good at one type of thing but poor at many other things. This was again a sort of conventional wisdom that floated around for a long time, supported by anecdotes, but never systematically measured. We did so by testing the few-shot performance of language models across 57 domains including elementary mathematics, US history, computer science, law, and more. What we found was that while ML models are indeed imbalanced—far better, for instance, at high school geography than high school physics—they are probably less imbalanced than humans on the same tasks, so savant is a wrong designation. (This is only one domain and doesn’t close the case—for instance AlphaZero is more savant-like.) We incidentally found that models were also poorly-calibrated, which may be an equally important issue to address.
Finally, to better understand the generalization of neural networks, we measured the bias-variance decomposition of the test error. This decomposes error into the variance (the error caused by random variation due to random initialization, choice of training data, etc.) and bias (the part of the error that is systematic across these random choices). This is in some sense a fairly rudimentary move as far as measurement goes (it replaces a single measurement with two measurements) but has been surprisingly fruitful. First, it helped explain strange “double descent” generalization curves that exhibit two peaks (they are the separate peaks in bias and variance). But it has also helped us to clarify other hypotheses. For instance, adversarially trained models tend to generalize poorly, and this is often explained via a two-dimensional conceptual example where the adversarial distribution boundary is more crooked (and thus higher complexity) than the regular decision boundary. While this conceptual example correctly predicts the bulk generalization behavior, it doesn’t always correctly predict the bias and variance individually, suggesting there is more to be said than the current story.
Optimization power is, roughly, the degree to which an agent can shape its actions or environment towards accomplishing some objective. This concept often appears when discussing risks from AI—for instance, a paperclip-maximizing AI with too much optimization power might convert the whole world to paperclips; or recommendation systems with too much optimization power might create clickbait or addict users to the feed; or a system trained to maximize reported human happiness might one day take control of the reporting mechanism, rather than actually making humans happy (or perhaps make us happy at the expense of other values that promote flourishing).
In all of these cases, the concern is that future AI systems will (and perhaps already do) have a lot of optimization power, and this could lead to bad outcomes when optimizing even a slightly misspecified objective. But what does “optimization power” actually mean? Intuitively, it is the ability of a system to pursue an objective, but beyond that the concept is vague. (For instance, does GPT-3 or the Facebook newsfeed have more optimization power? How would we tell?).
To remedy this, I will propose two ways of measuring optimization power. The first is about “outer optimization” power (that is, of SGD or whatever other process is optimizing the learned parameters of the agent), while the second is about “inner optimization” power (that is, the optimization performed by the agent itself). Think of outer optimization as analogous to evolution and inner optimization as analogous to the learning a human does during their lifetime (both will be defined later). Measuring outer optimization will primarily tell us about the reward hacking concerns discussed above, while inner optimization will provide information about “take-off speed” (how quickly AI capabilities might improve from subhuman to superhuman).
By outer optimization, we mean the process by which stochastic gradient descent (SGD) or some other algorithm shapes the learned parameters of an agent or system. The concern is that SGD, given enough computation, explores a vast space of possibilities and might find unintended corners of that possibility space.
To measure this, I propose the following. In many cases, a system’s objective function has some set of hyperparameters or other design choices (for instance, Facebook might have some numerical trade-off between likes, shares, and other metrics). Suppose we perturb these parameters slightly (e.g. change the stated trade-off between likes and shares), and then re-optimize for that new objective. We then measure: How much worse is the system according to the original objective?
I think this directly gets at the issue of hacking mis-specified rewards, because it measures how much worse you would do if you got the objective function slightly wrong. And, it is a concrete number that we can measure, at least in principle! If we collected this number for many important AI systems, and tracked how it changes over time, then we could see if reward hacking is getting worse over time and course correct if need be.
Challenges and Research Questions. There are some issues with this metric. First, it relies on the fairly arbitrary set of parameters that happen to appear in the objective function. These parameters may or may not capture the types of mis-specification that are likely to occur for the true reward function. We could potentially address this by including additional perturbations, but this is an additional design choice and it would be better to have some guidance on what perturbations to use, or a more intrinsic way of perturbing the objective.
Another issue is that in many important cases, re-optimizing the objective isn’t actually feasible. For instance, Facebook couldn’t just create an entire copy of Facebook to optimize. Even for agents that don’t interact with users, such as OpenAI’s DotA agent, a single run of training might be so expensive that doing it even twice is infeasible, let alone continuously re-optimizing.
Thus, while the re-optimization measurement in principle provides a way to measure reward hacking, we need some technical ideas to tractably simulate it. I think ideas from causal inference could be relevant: we could treat the perturbed objective as a causal intervention, and use something like propensity weighting to simulate its effects. Other ideas such as influence functions might also help. However, we would need to scale these techniques to the complex, non-parametric, non-convex models that are used in modern ML applications.
Finally, we turn to inner optimization. Inner optimization is the optimization that an agent itself performs durings its own “lifetime”. It is important because an agent’s inner optimization can potentially be much faster than the outer optimization that shapes it. For instance, while evolution is an optimization process that shapes humans and other organisms, humans themselves also optimize the environment, and on a much faster time scale than evolution. Indeed, while evolution was the dominant force for much of Earth’s history, humans have been the dominant force in recent millenia. This underscores another important point—when inner optimization outpaces outer optimization, it can lead to a phase transition in what outcomes are pursued.
With all of this said, I think it’s an error to refer to humans (or ML agents) as “optimizers”. While humans definitely pursue goals, we rarely do so optimally, and most of our behavior is based on adapted reflexes and habits rather than actually taking the argmax of some function. I think it’s better to think of humans as “adapters” rather than “optimizers”, and so by inner optimization I really will mean inner adaptation.
To further clarify this, I think it helps to see how adapting behaviors arise in humans and other animals. Evolution changes organisms across generations to be adapted to their environment, and so it’s not obvious that organisms need to further adapt during their lifetime—we could imagine a simple insect that only performs reflex actions and never learns or adapts. However, when environments themselves can change over time, it pays for an organism to not only be adapted by evolution but also adaptable within its own lifetime. Planning and optimizing can be thought of as certain limiting types of adaptation machinery—they are general-purpose and so can be applied even in novel situations, but are often less efficient than special-purpose solutions.
Returning to ML, we can measure inner adaptation by looking at how much an agent changes over its own lifetime, relative to how much it changes across SGD steps. For example, we could take a language model trained on the web, then deploy it on a corpus of books (but not update its learned parameters). What does its average log-loss on the next book look like after having “read” 0, 1, 2, 5, 10, 50, etc. books already? The greater the degree to which this increases, the more we will say the agent performs inner adaptation.
In practice, for ML systems today we probably want to feed the agent random sentences or paragraphs to adapt to, rather than entire books. Even for these shorter sequences, we know that state-of-the-art language models perform limited inner adaptation, because transformers have a fixed context length and so can only adapt to a small bounded number of previous inputs. Interestingly, previously abandoned architectures such as recurrent networks could in principle adapt to arbitrarily long sequences.
That being said, I think we should expect more inner optimization in the near future, as AI personal assistants become viable. These agents will have longer lifetimes and will need to adapt much more precisely to the individual people that they assist. This may lead to a return to recurrent or other stateful architectures, and the long lifetime and high intra-person and cross-time heterogeneity will incentivize inner adaptation.
The above discussion illustrates the value of measurement. Even defining the measurement (without yet taking it) clarifies our thinking, and thus leads to insight about transformers vs. RNNs and the possible consequences of AI personal assistants.
Research Questions. As with outer optimization, there are still free variables to consider in defining inner adaptation. For instance, how much is “a lot” of inner adaptation? Even existing models probably adapt more in their lifetime than the equivalent of 1 step of SGD, but that’s because a single SGD step isn’t much. Should we then compare to 1000 steps of SGD? 1 million? Or should we normalize against a different metric entirely?
There are also free variables in how we define the agent’s lifetime and what type of environment we ask it to adapt to. For instance, books or tweets? Random sentences, random paragraphs, or something else? We probably need more empirical work to understand which of these makes the most sense.
Take-off Speed. The measurement thought experiment is also useful for thinking about “take-off speeds”, or the rate at which AI capabilities progress from subhuman to superhuman. Many discussions focus on whether take-off will be slow/continuous or fast/discontinuous. The outer/inner distinction shows that it’s not really about continuous vs. discontinuous, but about two continuous processes (SGD and the agent’s own adaptation) that operate on very different time scales. However, we can in principle measure both, and similar to pollution externalities I expect we would be able to see inner adaptation increasing before it became the overwhelming dynamic.
Does that mean we shouldn’t be worried about fast take-offs? I think not, because the analogy with evolution suggests that inner adaptation could run much faster than SGD, and there might be a quick transition from really boring adaptation (reptiles) to something that’s a super-big deal (humans). So, we should want ways to very sensitively measure this quantity and worry if it starts to increase exponentially, even if it starts at a low level (just like with COVID-19!).