In machine learning, we are obsessed with datasets and metrics: progress in areas as diverse as natural language understanding, object recognition, and reinforcement learning is tracked by numerical scores on agreed-upon benchmarks. Despite this, I think we focus too little on measurement—that is, on ways of extracting data from machine learning models that bears upon important hypotheses. This might sound paradoxical, since benchmarks are after all one way of measuring a model. However, benchmarks are a very narrow form of measurement, and I will argue below that trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning. I’ll argue this in three ways:
- Historically, more measurement has almost always been a great move, not only in science but also in engineering and policymaking.
- Philosophically, measurement has many good properties that bear upon important questions in ML.
- In my own research, just measuring something and seeing what happened has often been surprisingly fruitful.