Jacob Steinhardt

New Blog Location

2021-10-13T00:00:00-07:00

I’ve switched the host of my blog to Ghost, so my blog is now located at https://bounded-regret.ghost.io/.

Feedly users can subscribe via RSS here. Or click the bottom-right subscribe button on this page to get e-mail notifications of new posts.

How Much Do Recommender Systems Drive Polarization?

2021-07-26T00:00:00-07:00

Polarization caused by social media is seen by many as an important societal problem, which also overlaps with AI alignment (since social media recommendations come from ML algorithms). I have personally begun directing some of my research to recommender alignment, which has gotten me curious about the extent to which polarization is actually driven by social media. This blog post is the first in a series that summarizes my current take-aways. I’ll start (in this post) by looking at aggregate trends in polarization, then connect them with micro-level data on Facebook feeds in later posts.

I started out feeling that most polarization probably comes from social media. As I read more, my views have shifted: I think there’s pretty good evidence that other sources, including cable news, have historically driven a lot of polarization (see DellaVigna and Kaplan (2006) and Martin and Yurukoglu (2017)), and that we would be highly polarized even without social media. In addition, most readers of this post (and myself) are “extremely online”, and probably intuitively overestimate the impact of social media on a typical American. However, it is possible that social media has further accelerated polarization to an important degree, but the data are too noisy to provide strong evidence either way.

As a final caveat, I am considering polarization specifically, and ignoring other issues such as fake news, which could also be important. That being said, here are some key conclusion in more detail:

Social media seems unlikely to be the major direct cause of increased polarization. The main evidence is that polarization since 2000 has only increased in some Western countries, despite relatively uniform uptake of internet use across countries. Some additional weaker evidence is that polarization in the U.S. has increased steadily since 1980 (so pre-internet) and increased the most in the 65+ age group (which has the least social media usage).
However, there are two counterarguments to consider. The first is that traditional polarization measures might not make sense in multi-party systems (which includes many European countries such as Germany and Italy), and that other correlates of polarization, such as the rise of populism, do seem more universal.
The second counterargument is that, while social media might not directly influence 65-year-olds much, there could be important indirect effects if social media changes the incentives of traditional media companies.
Consequently, it is possible that social media is an important accelerator of polarization, due to these incentive effects or for some other reason. I did not find strong evidence either for or against this, mainly due to lack of data.

Definitions

Researchers typically consider two types of polarization: affective polarization, which has to do with feelings toward the opposing party, and issue polarization, which has to do with concrete political stances. Affective polarization is measured by questions such as “How warmly do you feel towards…?” or “Would you feel okay with your child dating a {Democrat, Republican}?”. In contrast, issue polarization is higher if opinions on e.g. guns, abortion, and taxes are all perfectly correlated with each other, and lower if they are only somewhat correlated.

Iyengar has an excellent survey on affective polarization, arguing that it is the more dangerous of the two for a healthy democracy. Unfortuantely, it has also increased significantly over time in the U.S., as we will see below. Most of the metrics below measure affective polarization, although some measure a combination of the two, and I won’t be too careful to distinguish (we’ll distinguish them more carefully in later posts).

Sources

To reach these conclusions, I primarily used the following sources:

Social Media, News Consumption, and Polarization: Evidence from a Field Experiment (Levy, 2020)
Cross-Country Trends in Affective Polarization (Boxell et al., 2020)
Greater Internet use is not associated with faster growth in political polarization among US demographic groups (Boxell et al., 2017)

I also consulted these other sources:

The Origins and Consequences of Affective Polarization in the United States (Iyengar et al., 2019), which among other things documents the case for affective polarization having important consequences (e.g. affecting job offers and dating).
The Effect of Social Media on Elections: Evidence from the United States (Fujiwara et al. 2020), but I didn’t lean on this because its causal analysis relied on an instrument that I didn’t have enough intuition for.
The Welfare Effects of Social Media (Allcott et al., 2019), but I didn’t lean on this because I was worried about sample size.

Social media seems unlikely to be the major direct cause of polarization, although it is possible that it has accelerated trends in polarization that already exist. The data do not say much either way regarding acceleration, due to finite-sample noise.

Not Likely the Major Cause

The first evidence against social media being the major direct cause of polarization are the cross-country trends in affective polarization from Boxell et al. (2020), shown below. Primarily the US, Canada, and UK show increasing polarization since 2000, while other countries show flat or decreasing trends. This is despite fairly similar trends in internet and broadband penetration across the 9 Western countries studied.

However, two European readers of this draft noted that for many-party systems, it is unclear that we can interpret polarization in the same way as for a two-party system. Thus perhaps we can only really meaningfully compare the US, UK, and Canada in the graph above (but then the UK is still a surprisingly non-polarized outlier, but perhaps Brexit and other recent developments show that this was itself temporary).

Acknowledging this caveat, US polarization itself seems well-explained by other trends. For instance, Boxell et al. (2017) find that polarization, averaged across 8 indicators, has increased steadily since 1980:

This trend in polarization seems better explained by the post-Civil Rights, post-suburbanization party realignment around 1970, in which the South flipped to being Republican and the Republican party itself focused more on identity and values politics. It’s not clear that we need social media as an additional explanatory variable for these trends, although if we squint at 2008-2016 we can see a (possibly spurious) uptick in the slope that could be compatible with social media accelerating the trend in polarization.

Another moderate piece of evidence against social media being the primary cause is that polarization has increased the most (from 1996 to 2016) in older age groups that use the internet less. Below are partisan affect and the overall index measured for each of several age groups (Diff is the difference between 65+ and 18-39):

Measure	Overall	18–39	40–64	65+	Diff
Partisan affect	9.1 (3.0)	4.3 (4.9)	8.9 (4.3)	13.5 (7.7)	9.27 (9.28)
Index	0.28 (0.04)	0.23 (0.06)	0.23 (0.06)	0.47 (0.08)	0.23 (0.10)
Social media use		90%	65%	33%
Population share		38.2%	40.5%	21.3%

The Partisan Affect and Index are from Table 1 of the Boxell paper, Social media use is from eyeballing Figure 2 of that paper, and the Population share came from eyeballing this graph.

Back-of-the-envelope calculations. If we assume all increase in polarization among 18-39 is from social media use, and weight the other age categories in accordance with this, then it explains (0.382 + 0.405 * 0.65/0.9 + 0.213 * 0.33/0.9) * 0.23 = 0.17 out of the 0.28 points, or 61%, which would be consistent with it being the major driver. On the other hand, this is too high because we should only attribute post-2008 increases to social media. Unfortunately, breaking out across age groups and only looking at an 8-year span yields data that are too noisy to easily interpret. I include this data below for completeness (ignore the grayed out lines):

One final caveat is that Pew finds less sharp differences in social media use across age groups.

Overall, I find the age-related data less convincing than the country-related data. It provides some evidence that social media is not the primary driver, but seems insufficient to obviously bound its effect or to test the acceleration hypothesis. Since the country-level data also has interpretation issues, I don’t think this completely rules out social media as a cause. However, I do think the strong historical increases pre-internet, as well as the general predominance of cable news, show that we would have a lot of polarization even without the internet or social media.

Are We Abnormally Online?

Most readers of this blog post consume significantly more news via the internet than does the average American. For instance, Figure 4 in this paper show that every age group (starting at 18-24 years olds) consumes more news via television than online. This is true for approximately zero of the people I know personally (except possibly my parents). As a result, I’ve relied less on my personal intuitions about social media polarization and more on statistical data.

Of course, it could be that social media drives polarization only among a small elite segment of highly educated Americans, but that this is nevertheless important because those Americans include most leaders in government, tech, and other important sectors. I place some weight on this being true (but think this would imply much different policy recommendations than those currently being discussed).

If we draw a linear trend through the overall polarization time series in the U.S., we see that the increase since 2008 has been above trend:

With the most generous choice of start and end points (1972 and 2008), it looks like the polarization increase from 2008 and 2016 might even be 5x what would be predicted under the trend! However, it’s difficult to tell this apart from noise, and you can get pretty different results based on where you start the line from (although for all that I checked, 2016 is substantially above trend). This looks like weak positive evidence for the acceleration hypothesis to me.

Conclusion

My overall best guess is that most polarization in the U.S. starting from 1970 has come from sources other than social media, and that TV news in particular is a strong driver. However, there is some evidence of a particularly sharp increase in polarization starting in the 2010s, which I give a ~50% chance to being driven (indirectly) by social media. If social media is a major indirect driver, I guess that it is either because it changes incentives for traditional media (~30%) or because it affects elite attitudes, which then percolate (~20%).

To better understand this, we would need to better understand the incentives created by social media. This is difficult, primarily because it is difficult to measure: you would need micro-level data on the the ideological slant or affect of news articles, which is difficult to collect at scale. Some awesome political scientists and economists are currently doing this, and our lab is helping them to build NLP models to help measure slant. I’m excited to share the results of this once we have them!

Thanks to Luca Braghieri and Markus Mobius for providing feedback on this post.

Economic AI Safety

2021-07-08T00:00:00-07:00

There is a growing fear that algorithmic recommender systems, such as Facebook, Youtube, Netflix, and Amazon, are having negative effects on society, for instance by manipulating users into behaviors that they wouldn’t endorse (e.g. getting them addicted to feeds, leading them to form polarized opinions, recommending false but convincing content).

Some common responses to this fear are to advocate for privacy or to ask that users have greater agency over what they see. I argue below that neither of these will solve the problem, and that the problem may get far worse in the future. I then speculate on alternative solutions based on audits and “information marketplaces”, and discuss their limitations.

Existing discourse often mistakenly focuses on individual decisions made by individual users. For instance, it is often argued that if a company is using private data to manipulate a user, that user should have the right to ban the company’s use of the data, thereby stopping the manipulation. The problem is that the user benefits greatly from the company using their private data—Facebook’s recommendations, without use of private information, would be mostly useless. You can see this for yourself by visiting Youtube in an incognito tab. Most users will not be willing to use severely hobbled products for the sake of privacy (and note that privacy is not itself a guaranteed defense against manipulation).

A second proposal is to provide users with greater agency. If instead of passively accepting recommendations, we can control what the algorithm shows us (perhaps by changing settings, or by having recommendations themselves come with alternatives), then we can eventually bend it to satisfy our long-term preferences. The problems here are two-fold. First, even given the option, users rarely customize their experience; a paper on Netflix’s recommender system asserts:

Good businesses pay attention to what their customers have to say. But what customers ask for (as much choice as possible, comprehensive search and navigation tools, and more) and what actually works (a few compelling choices simply presented) are very different.

Thus to the extent that user agency is helpful, it would be from providing information about a small and atypical subset of power users, which must then be extrapolated to other users. In addition, even those power users are at the mercy of information extracted from other sources. This is because while, in theory, an algorithm could completely personalize itself to a user’s whims, this would take an infeasibly long time without some external prior. If you’ve shown me Firefly and Star Trek, how will you guess whether I like Star Wars without e.g. information from other users or from movie reviews? Thus, while we can provide users with choices, most of the decisions–determining the user’s choice set–have already been made implicitly at the outset.

It helps to put these issues in the context of the rest of the economy, to understand which parts of this story are specific to recommender systems. In the economy at large, businesses offer products that they sell in stores or in online marketplaces. Users visit these stores and marketplaces and choose the items that most appeal to them relative to price. Businesses put some effort into getting their products into stores, into creating appealing packaging, and to spreading the word about how great their product is. These all implicitly affect customers’ choice sets, which in any case is restricted to the finite number of items seen in a store or on the first page or so of online results. At the same time, many default choices (such as clean drinking water) are made already by governments, at least in economically developed countries.

Under this system, while branding and marketing can influence user behavior, it is difficult for businesses to trick customers into buying clearly suboptimal products. Even if such a product gains market share, eventually news of a superior product will spread through word of mouth. There are exceptions when quality is hard to judge (as in the case of medical advice) or when negative effects are subtle or delayed (such as lead poisoning). Even in these cases, if a company is sufficiently shady then customers may decide to boycott it, although boycotts often occur on the basis of sketchy information and mood affiliation, so it isn’t clear how useful this mechanism is. Finally, some products may take advantage of psychological needs or weaknesses of customers, for instance helping them cope with sadness or anxiety in an unhealthy and self-perpetuating way (e.g. binge-watching TV episodes, eating tubs of ice cream, doing drugs). While competing products (gym memberships, meditation classes) can push in the other direction, in practice the more exploitative products seem to often win out.

Returning to algorithmic recommender systems, we can see that many of the properties we were worried about are already present in the economy at large. Specifically, most decisions are already made for customers, with a small finite choice set dictating the remaining options. Businesses do try to manipulate customers and in some cases these manipulations are successful.

There are a few differences, however. First, in a typical marketplace users can choose between different versions of the same item. This is less true for recommender systems—there is only one Facebook, and collectively a small number of companies are in charge of most social media. While my choices of what to click on influence Facebook’s recommendations, I have no obvious recourse if the recommendations remain persistently misaligned with my preferences. A second, complementary issue is that Facebook’s business model is not to make me happy, but to produce value for advertisers. This exacerbates the lack of recourse from outside options. Even for a company that is trying to produce value for users, outside options are important in case the company’s assumptions on what generates value are wrong; but when the company is not even trying to produce value, outside options are crucial.

This points to one potential solution, which is to create more competition among recommender systems. If many products could generate alternative versions of the Facebook feed, allowing users to choose among them, then Facebook would have to produce a product that users wanted more than those alternatives. Even if its business model remained ad-based, it would have to compete with other services that, for instance, offered a monthly subscription fee in exchange for a higher quality feed. (I’m ignoring the many obstacles to this–since recommender systems benefit from network effects, you would probably have to enforce compatibiltiy or data sharing across different recommender systems to create actual competition, which is an important but nontrivial problem.)

While competition would help, it wouldn’t solve the problem entirely. On the positive side, products would succeed by convincing people to use and pay money for them, and would not survive in the long-run if they eschewed obvious and accessible improvements. But the effects of recommender systems, like medical advice, are difficult to fully ascertain. They could induce the psychological equivalent of lead poisoning and it would take a long time to identify this. This is particularly worrying for recommender systems that affect our information diet, which itself strongly affects our choice sets. It will be even more worrying when algorithmic optimizers affect our daily environment, as is beginning to be the case with services like Alexa and Nest.

Our environment is the strongest determiner of our choice sets and so mis-aligned optimization of our environment may be difficult to undo. In the short run, this likely won’t be an (apparent) concern: the immediate effect of optimized environments is that most people’s environments will become substantially better. Perhaps this will also be true in the long run: environments will be better, and optimizers don’t learn to adversarially manipulate them. However, given the ease of using environments to manipulate decisions, I don’t see what existing mechanisms would prevent such manipulation from happening.

Here’s one attempt at designing a mechanism. To recap, the problem is that people have a difficult time understanding how algorithmic optimizers affect their decisions (and so can’t provide a negative reward signal in response to being manipulated). But people certainly want to understand this, so there should be a market demand for “auditors” that examine these systems and report undesirable effects to users. So perhaps we should seek to create this market?

However, I’m not sure most users could understand these audits, or distinguish between trustworthy and untrustworthy auditors. At least today, most people seem confused about what exactly is wrong with recommender systems, and news articles–arguably a weak form of auditing–often contribute to that confusion. Is there any robust way of incentivizing useful audits? Has this ever worked out in other industries, such as medicine or food safety? It’s unclear to me. I think we want some sort of information market, consisting of both auditors and counterauditors (who expose issues with the auditors), and to think carefully about how to design incentives that converge to truthful outcomes.

In conclusion, we are running towards a future in which more and more of our choice sets will be subject to strong optimization forces. Perhaps robust agency within those choice sets will offer a way out, but we should keep in mind that most of the action is elsewhere. Optimizing these other parts–our environment and our information diet–could lead to great good, but could also lead to irreversible manipulation. None of the solutions currently discussed keep us safe from the latter, and more work is needed.

Film Study for Research

2021-06-28T00:00:00-07:00

Research ability, like most tasks, is a trainable skill. However, while PhD students and other researchers spend a lot of time doing research, we often don’t spend enough time training our research abilities in order to improve. For many researchers, aside from taking classes and reading papers, most of our training is implicit, through doing research and interacting with mentors (usually a single mentor–our PhD advisor or research manager). By analogy, we are like basketball players who somehow made it to the NBA, and are now hoping that simply playing basketball games will be enough to keep improving.

Drawing on this analogy, I want to talk about two habits that are ubiquitous among elite athletes, that have analogs in research that I feel are underutilized. Those who do pursue these habits as PhD students often improve quickly as researchers.

The first habit is film study. Almost every high-level athlete watches films of other players of the same sport, including historical greats, contemporary rivals, and themselves. This allows them to incorporate good ideas from other players’ games as well as to catch and eliminate flaws in their own game. Even the very best players benefit from watching film of themselves and others.

The second habit, which I call act-reflect-ask, occurs in the course of a game or scrimmage. I’ll describe this from my own experience (although I’m by no means an elite athlete, I’ve learned this from people who are). After a point ends, I generally think about what happened during the point–Was there anything I wanted to do better? Did anything unexpected happen? Then I’ll re-run those parts in my head, simulating what I would have done differently until I feel like I know how to consistently make the right decision. In some cases, I can’t figure it out–perhaps I was playing defense, someone beat me, and I can’t figure out what they did or can’t figure out the counter. In that case I’ll ask a teammate about it (or the person who beat me, if it’s a friendly scrimmage) and talk it over until I see the right strategy for the future.

Both of these strategies are invaluable for improving. They leverage the fact that as humans, we tend to learn socially: we are very good at adopting strategies from others, so film study and asking are efficient ways to learn. Both strategies also lead to deliberate practice focused on real-world contexts. Below, I’ll show that these strategies have analogs in research, and argue that good researchers should adopt both into their own habits.

Film Study

As mentioned above, good athletes watch lots of film of other athletes. This extends to other skills as well–most chess players, including grandmasters, study games by both contemporary and historical greats. They do this to understand how other very strong players play, in order to adopt ideas and, in the case of rivals, to counter those ideas (this part is less relevant to research). Even the very best players do this.

What is the equivalent to this in research? Ideally, we would watch world experts as they work, observing how they think, perform experiments, and so on. Unfortunately, this is difficult–much research work is internal rather than external, and we don’t routinely film great researchers in the same way as we do with athletes. The closest obvious analog is working closely with a mentor, as many PhD students do with their PhD advisor. Then, it is often possible to see first-hand how a more experienced researcher approaches a problem. However, this isn’t scalable, and most people only get to do this with one person–their advisor. (As an aside, it is very useful for students to develop a good model of their advisor’s thinking style–I think this tends to be underrated.)

A more scalable approach would be reading papers, but this doesn’t achieve the full goal of film study–you only see the finished product, rather than the thought process, and it tends to only show the part of a writer’s thoughts that are widely defensible. What we want is a public record of someone’s thoughts, including off-the-cuff thoughts that wouldn’t make it into a paper.

In fact, we do have this, in the form of blogs. The right type of blogs can provide a valuable form of “film study”. I personally learned a lot about statistics from Andrew Gelman’s blog. Often, someone sends him a paper and he just gives his off-the-cuff reactions to it: what he liked and didn’t, what was convincing, what parts seem sketchy. I probably learned more from reading his blog than from statistics classes (of which I’ve taken embarrassingly few, yet somehow managed to get hired by a Statistics department; I’ll credit Gelman for this). Scott Aaronson’s blog is good in the same way for theoretical computer science. Many posts on the GiveWell and Open Philanthropy blogs are good in this way, too. In all cases, I’d look at the earlier rather than later posts (though not the very earliest); the reason is that once blogs have too large an audience, writers start to feel constrained to write more “professionally” and you get less of the valuable off-the-cuff thinking.

In addition to blogs, debates are another good source of off-the-cuff, in-the-moment thinking, as long as the participants don’t overprepare and as long as they are trying to make good arguments rather than score rhetorical points. Actually, the best debates I’ve seen also take part via blogs, such as the debate over de-worming in global health. Seminars can be good film study, but are primarily film study for giving presentations rather than doing research (and for this, also watch recordings of great talks online). Seminar Q&A can be good film study for research thinking, as long as participants are opinionated and express those opinions in a clear way that exposes their underlying mental model. For programming, you can watch people code on Twitch, or pair program with other students in your research group.

The above are all useful sources of in-the-moment thinking. For research, we also make decisions–such as what directions to pursue–that have consequences on the scale of years. To film study these, I read histories of important scientific developments. Good histories will follow individuals around in detail for an extended period of time, ideally with primary sources. For instance, The Making of the Atomic Bomb covers developments in physics up to and through the Manhattan project, and discusses many of the decisions, discoveries, and dead ends faced by Fermi, Szilard, Oppenheimer, and others. (The dead ends are especially important, so that you can see the whole process and not just what is useful today.) Another great example is The Eighth Day of Creation, which does the same for the development of modern biotechnology. Such histories have helped me gain a better understanding of how science develops on the scale of years or decades, which I would otherwise have to learn the hard way, over my own years and decades of research.

Some other miscellaneous advice: transcripts of talks can sometimes be good in the same way as blogs. Richard Hamming’s “You and Your Research” is excellent on this front. For talks, recording yourself and watching the recording may be the fastest route to improvement. Finally, in addition to histories, case studies (often taught in law or business courses) also provide information that would be expensive to gather otherwise.

In summary, film study blogs for off-the-cuff research thinking; watch great presentations and record yourself to learn how to speak; pair program and watch programming streams; and read histories of science for long-term research decisions.

Act-Reflect-Ask

In the act-reflect-ask loop, we reflect on whether something could have gone better after we do it, and ask someone else if we can’t figure it out. There are many ways to do this in research:

When seeing a proof, if you don’t see how you would have come up with the proof yourself, discuss with others how to do so (this is usually what people mean when they ask “what’s the motivation for that step?”). The same goes whenever you see a cool experiment or idea that you’re not sure you would have come up with yourself. First try to think about whether there’s a way to modify your thought process to reliably come up with such ideas in the future. If not, discuss with the presenter so that you can learn.
After you give a talk, pull aside one of the audience members and get feedback on what worked/didn’t work in the talk.
After attending a seminar, discuss what was or wasn’t convincing, what was most interesting, etc. Paper reading groups are valuable as they often focus on this. (This isn’t quite act-reflect-ask since the seminar was given by someone else; but you can think of it as a way of checking your own thoughts during the seminar against others’.)
Every week, reflect on what things felt less efficient than they needed to be. Think for yourself how to improve these, then talk to friends, colleagues, or mentors to get additional ideas.

In addition to helping yourself improve, these habits help others as well–asking someone for advice engages their own thinking in a growth-oriented direction, so by helping you they are likely improving themselves, too. This also helps at the level of teams, as it builds chemistry and creates a shared culture of excellence and growth. Indeed, in sports, the best teams do this regularly, and veteran players are proactive in finding ways to help younger players. Some professional players even stay in a league, making millions of dollars a year, solely by being excellent sources of advice and mentorship.

Summary

Find ways to routinely study research decision-making, through blogs, seminars, video streams, and histories. Actively consume these to adopt and build up effective mental heuristics. Whenever you do something, reflect on how it could be better, and ask others for advice. As you learn more yourself, find ways to give back to others. Consistently doing these will help you to become a better researcher over time, and contribute to a culture of excellence among those around you.

Donations for 2019/2020

2021-06-23T00:00:00-07:00

Each year I aim to donate around 10% of my income. In 2019, I fell behind on this, probably due to the chaos of COVID-19 (but really this was just an embarassing logistical failure on my part). I’ve recently, finally, finished processing donations for 2019 and 2020. In this post I write about my decisions, in case they are useful to others; see also here for a past write-up from 2016.

The impact of COVID-19 on poor countries made me better appreciate how much better I have it than most of the world, so I tried to donate closer to 20% of my 2020 income, and that will be my goal moving forward as well. Between 2019 and 2020, this came out to \$45,000 in total. My aim is to have the greatest positive impact possible with my funding, averaging over a few different moral frameworks (i.e. both ones that place significant weight on future generations, and ones that prioritize those currently in need).

This year, I allocated donations across the following cause areas: 45% to helping the global poor, 51% to protecting the long-term future of humanity, and 4% to miscellaneous causes, primarily in US policy. In addition, I had previously allocated \$3300 to US political causes during the year. I give more detail on each below.

Global Health and Development

14% to the GiveWell Maximum Impact Fund
12% to the Global Health and Development Fund
17% to the Center for Global Development
2% to GiveWell’s operating expenses

Reasoning: The Global Health and Development Fund and Maximum Impact Fund are both related to GiveWell; the former is run by GiveWell itself, while the latter is independent but managed by GiveWell’s CEO, Elie Hassenfeld. The difference is that GiveWell, for brand reasons, is committed to funding organizations with a strong evidence base, while the GHDF can choose to fund higher-risk, higher-reward opportunities within global health and development, although it also funds GiveWell top charities with some fraction of its money.

I had decided that I wanted around 20% of my donations to go toward helping the global poor in a relatively straightforward way (i.e. not through research about what to do in the future, but direct interventions that will help today). Based on my estimate of their portfolios, the 14% + 12% mix between these two funds got me to the 20% target while also allocating some money towards research. The Center for Global Development is primarily focused on research, and has a strong track record of past effectiveness. Finally, GiveWell recommends allocating 10% of the donation to them to operating expenses, which I rounded up to 2%.

Reflections: Overall, I think I estimate that this allocation gave 20% to straightforward interventions, 23% to research, and 2% to GiveWell’s operating expenses.

In retrospect, I think a better allocation would have been 40% to GHDF and 5% to GiveWell. The reason is that GHDF is actively managed by someone who I trust, who has similar goals than me, and who is significantly more informed than I am, so I would expect whatever allocation Elie chooses to be better than what I chose above. In addition, I have grown more comfortable with higher-risk donations; I was already fairly comfortable with them, allocating ~80% to high-risk/high-reward opportunities, but I’d now feel okay with up to ~90%.

One additional reason to favor GHDF is the following quote on their webpage:

Thus, until GiveWell finds opportunities that surpass Open Philanthropy’s available funding, donations to this fund are most likely to either:

(a) displace Open Philanthropy funding of Incubation Grant-like opportunities (and cause that funding to instead go to GiveWell’s top charities), or

(b) go directly to GiveWell’s top charities.

Nonetheless, donating to this fund is valuable because it helps demonstrate to GiveWell that there is donor demand for higher-risk, higher-reward global health and development giving opportunities.

Finally, I feel that giving only 2% to GiveWell created perverse incentives: if GHDF hadn’t existed, I would have donated more to GiveWell and thus given more to cover their operating costs. Since GHDF is run by the CEO of GiveWell, it seems incorrect to penalize GiveWell for GHDF’s existence, so moving forward I will allocate 10% of my {GiveWell + GHDF} donation to cover operating expenses.

Long-Term Future

25% to the Long-Term Future Fund
25% to support work on better AI policy
1% to the Legal Priorities Project

Since I work in AI, and do some work intersecting policy, I chose to keep the particular organization(s) in the second group private, to avoid a conflict of interest where my status as a donor might increase my social capital.

Reasoning: The Long-Term Future Fund funds technical or conceptual research oriented towards safeguarding the long-term future of humanity. They are actively managed and mostly give small grants to individual researchers or small organizations, an approach which I think has the potential for high impact. While some of the areas they focus on, such as safe AI, are not primarily cash-constrained, I think LTFF does a good job of identifying instances where cash can actually help. In some cases, they made grants that I was initially skeptical of but that in retrospect seemed like good ideas. I therefore trust their judgment to align reasonably well with what I would conclude after significant investigation.

On the other hand, they mostly do not fund policy-related work, and I think that good AI policy, especially surrounding international conflict and arms races, could be very important for humanity’s long-term future. I therefore split my donations in this area in half between these two directions.

I also donated a small amount to the Legal Priorities Project. They are a relatively new organization that in part seeks to improve law to take future generations into account; they also tackle several other questions in legal research that they view as high-impact. They were one of several small organizations that I investigated this year, and seemed the most impressive to me. Although they do not yet have a clear track record of success, I don’t view this as unusual for an organization at their age, and I think supporting good organizations early on before they are clearly successful is often the highest-impact (assuming you can pick the organizations well, which can be difficult).

Reflections: This year I put a non-trivial amount of effort into evaluating LPP as well as several other small organizations. In retrospect, I think LTFF probably put significantly more effort than I did into evaluating all of the organizations I looked at, as well as several others, when deciding on their grant allocations. Therefore, in the future I would probably just allocate to LTFF and trust their decision-making.

I feel somewhat worried about this, because if everyone pursues this strategy then it would concentrate grant-making in a small number of organizations, which could distort the overall funding ecosystem. That being said, I think the ultimate solution is to have other competitors to LTFF, rather than making low-information decisions as an individual. My hope is that funding them generously this year will help incentivize the creation of other strong grantmaking organizations.

Miscellaneous

4% to the International Refugee Assistance Project
\$3300 to other U.S. political causes

Reasoning: I am less confident that these donations maximize impact compared to the ones above, although I do feel that IRAP is a very good organization. The main reason these wouldn’t maximize impact is that they are U.S.-centric, while most of the strongest philanthropic opportunities lie abroad.

I felt that IRAP was plausibly in the same ballpark as global health interventions in terms of impact, since they focus on immigration reform, whose beneficiaries are primarily in other countries. This is a neglected policy area within the U.S., and policy can be a strong philanthropic lever in areas that are not entrenched along partisan lines. A secondary benefit is that better immigration policy could help recruit more talented researchers to the U.S., which could help in other areas such as AI.

The other political causes were focused only on the U.S., in areas that I felt a strong personal obligation to address (since they are both political and personal in nature, I did not list the organizations themselves). I didn’t count them towards the “10%/20% of income” target, since they were partly based on personal emotional appeal rather than an impersonal attempt to maximize impact, but I am still glad that I made them.

Thoughts on Overall Allocation

As noted at the top, I allocated 51% of my donations to the long-term future, 45% to global health and development, and 4% to other causes. In subsequent years, I would like to allocate more to the long-term future bucket, as I feel it is one of the most important and neglected cause areas. However, I have struggled to find outstanding opportunities in this area, which is why I currently am closer to a 50-50 split with global health. If there were better opportunities, I would probably shift to a ~70-30 split that was more future-focused.

Other Notes

For most of the larger donations, I donated appreciated stock rather than cash. This was the first year I did this, and was complicated to figure out initially, but will be easy to repeat in future years now that I have it figured out. The reason to do this is that it can yield substantial tax benefits, and I would recommend it to most people who have an investment account and make significant donations each year.

Measurement, Optimization, and Take-off Speed

2021-04-07T00:00:00-07:00

In machine learning, we are obsessed with datasets and metrics: progress in areas as diverse as natural language understanding, object recognition, and reinforcement learning is tracked by numerical scores on agreed-upon benchmarks. Despite this, I think we focus too little on measurement—that is, on ways of extracting data from machine learning models that bears upon important hypotheses. This might sound paradoxical, since benchmarks are after all one way of measuring a model. However, benchmarks are a very narrow form of measurement, and I will argue below that trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning. I’ll argue this in three ways:

Historically, more measurement has almost always been a great move, not only in science but also in engineering and policymaking.
Philosophically, measurement has many good properties that bear upon important questions in ML.
In my own research, just measuring something and seeing what happened has often been surprisingly fruitful.

Once I’ve sold you on measurement in general, I’ll apply it to a particular case: the level of optimization power that a machine learning model has. Optimization power is an intuitively useful concept, often used when discussing risks from AI systems, but as far as I know no one has tried to measure it or even say what it would mean to measure it in principle. We’ll explore how to do this and see that there are at least two distinct measurable aspects of optimization power that have different implications (one on how misaligned an agent’s actions might be, and the other on how quickly an agent’s capabilities might grow). This shows by example that even thinking about how to measure something can be a helpful clarifying exercise, and suggests future directions of research that I discuss at the end.

Measurement Is Great

Above I defined measurement as extracting data from [a system], such that the data bears upon an important hypothesis. Examples of this would be looking at things under a microscope, performing crystallography, collecting survey data, observing health outcomes in clinical trials, or measuring household income one year after running a philanthropic program. In machine learning, it would include measuring the accuracy of models on a dataset, measuring the variability of models across random seeds, visualizing neural network representations, and computing the influence of training data points on model predictions.

I think we should measure far more things than we do. In fact, when thinking about almost any empirical research question, one of my early thoughts is Can I in principle measure something which would tell me the answer to this question, or at least get me started? For an engineering goal, often it’s enough to measure the extent to which the goal has not yet been met (we can then perform gradient descent on that measure). For a scientific or causal question, thinking about an in-principle (but potentially infeasible) measurement that would answer the question can help clarify what we actually want and guide the design of a viable experiment. I think asking yourself this question pretty much all the time would probably make you a stronger researcher.

Below I’ll argue in more detail why measurement is so great. I’ll start by showing that it’s historically been a great idea, then offer philosophical arguments in its favor, and finally give instances where it’s been helpful in my own research.

Historical Support

Throughout the history of science, measuring things (often even in undirected ways) has repeatedly proven useful. Looking under a microscope revealed red blood cells, spermatozoa, and micro-organisms. X-ray crystallography helped drive many of the developments in molecular biology, underscored by the following quote from Francis Crick (emphasis mine):

But then, as you know, at the atomic level x-ray crystallography has turned out to be extremely powerful in determining the three-dimensional structure of macromolecules. Especially now, when combined with methods for measuring the intensities automatically and analyzing the data with very fast computers. The list of techniques is not something static—and they’re getting faster all the time. We have a saying in the lab that the difficulty of a project goes from the Nobel-prize level to the master’s-thesis level in ten years!

Indeed, much of the rapid progress in molecular biology was driven by a symbiotic relationship between new discoveries and better measurement, both in crystallography and in the types of assays that could be run as we discovered new ways of manipulating biological structures.

Measurement has also been useful outside of science. In development economics, measuring the impact of foreign aid interventions has upended the entire aid ecosystem, revealing previously popular interventions to be nearly useless while promoting new ones such as anti-malarial bednets. CO2 measurement helped alert us to the dangers of climate change. GPS measurement, initially developed for military applications, is now used for navigation, time synchronization, mining, disaster relief, and atmospheric measurements.

A key property is that many important trends are measurable long before they are viscerally apparent. Indeed, sometimes what feels like a discrete shift is a culmination of sustained exponential growth. One recent example is COVID-19 cases. While for those in major U.S. cities, it feels like “everything happened” in a single week in March, this was actually the culmination of exponential spread that started in December and was relatively clearly measurable at least by late January, even with limited testing. Another example is that market externalities can often be measured long before they become a problem. For instance, when gas power was first introduced, it actually decreased pollution because it was cleaner than coal, which is what it mainly displaced. However, it was also cheaper, and so the supply of gas-powered energy expanded dramatically, leading to greater pollution in the long run. This eventual consequence would have been predictable by measuring the rapid proliferation of gas-powered devices, even during the period where pollution itself had decreased. I find this a compelling analogy when thinking about how to predict unintended consequences of AI.

Philosophical Support

Measurement has several valuable properties. The first is that considering how to measure something forces one to ground a discussion—seemingly meaningful concepts may be revealed as incoherent if there is no way to measure them even in principal, while intractable disagreements might either vanish or quickly resolve when turned into conflicting disagreements about an empirically measurable outcome.

Within science, being able to measure more properties of a system creates more interlocking constraints that can error-check theories. These interlocking constraints are, I believe, an underappreciated prerequisite for building meaningful scientific theories in the first place. That is, a scientific theory is not just a way of making predictions about an outcome, but presents a self-consistent account of multiple interrelated phenomena. This is why, for instance, exceeding the speed of light or failing to conserve energy would require radically rethinking our theories of physics (this would not be so if science was only about prediction and not the interlocking constraints). Walter Gilbert, Nobel laureate in Chemistry, puts this well (emphasis again mine):

“The major problem really was that when you’re doing experiments in a domain that you do not understand at all, you have no guidance what the experiment should even look like. Experiments come in a number of categories. There are experiments which you can describe entirely, formulate completely so that an answer must emerge; the experiment will show you A or B; both forms of the result will be meaningful; and you understand the world well enough so that there are only those two outcomes. Now, there’s a large other class of experiments that do not have that property. In which you do not understand the world well enough to be able to restrict the answers to the experiment. So you do the experiment, and you stare at it and say, Now does it mean anything, or can it suggest something which I might be able to amplify in further experiment? What is the world really going to look like?”

“The messenger experiments had that property. We did not know what it should look like. And therefore as you do the experiments, you do not know what in the result is your artifact, and what is the phenomenon. There can be a long period in which there is experimentation that circles the topic. But finally you learn how to do experiments in a certain way: you discover ways of doing them that are reproducible, or at least—” He hesitated. “That’s actually a bad way of saying it—bad criterion, if it’s just reproducibility, because you can reproduce artifacts very very well. There’s a larger domain of experiments where the phenomena have to be reproducible and have to be interconnected. Over a large range of variation of parameters, so you believe you understand something.”

This quote was in relation to the discovery of mRNA. The first paragraph of this quote, incidentally, is why I think phrases such as “novel but not surprising” should never appear in a review of a scientific paper as a reason for rejection. When we don’t even have a well-conceptualized outcome space, almost all outcomes will be “unsurprising”, because we don’t have any theories to constrain our expectations. But this is exactly when it is most important to start measuring things!

Finally, measurement often swiftly resolves long-standing debates and misconceptions. We can already see this in machine learning. For distribution shift, many feared that highly accurate neural net models were overfitting their training distribution and would have low out-of-distribution accuracy; but once we measured it, we found that in- and out-of-distribution accuracy were actually strongly correlated. Many also feared that neural network representations were mostly random gibberish that happened to get the answer correct. While adversarial examples show that these representations do have issues, visualizing the representations shows clear semantic structure such as edge filters, at least ruling out the “gibberish” hypothesis.

In both of the preceding cases, measurement also led to a far more nuanced subsequent debate. For robustness to distribution shift, we now focus on what interventions beyond accuracy can improve robustness, and whether these interventions reliably generalize across different types of shift. For neural network representations, we now ask how different layers behave, and what the best way is to extract information from each layer.

Personal Anecdotes

This brings us to the success of measurement in some of my own work. I’ve already hinted at it in talking about machine learning robustness. Early robustness benchmarks (not mine) such as ImageNet-C and ImageNet-v2 revealed the strong correlation between in-distribution and out-of-distribution accuracy. They also raised several hypotheses about what might improve robustness (beyond just in-distribution accuracy), such as larger models, more diverse training data, certain types of data augmentation, or certain architecture choices. However, these datasets measured robustness only to two specific families of shifts. Our group then decided to measure robustness to pretty much everything we could think of, including abstract renditions (ImageNet-R), occlusion and viewpoint variation (DeepFashion Remixed), and country and year (StreetView Storefronts). We found that almost every existing hypothesis had to be at least qualified, and we currently have a more nuanced set of hypotheses centered around “texture bias”. Moreover, based on a survey of ourselves and several colleagues, no one predicted the full set of qualitative results ahead of time.

In this particular case I think it was important that many of the distribution shifts were relative to the same training distribution (ImageNet), and the rest were at least in the vision domain. This allowed us to measure multiple phenomena for the same basic object, although I think we’re not yet at Walter Gilbert’s “larger domain of interconnected phenomena”, so more remains to be done.

Another example concerned the folk theory that ML systems might be imbalanced savants—very good at one type of thing but poor at many other things. This was again a sort of conventional wisdom that floated around for a long time, supported by anecdotes, but never systematically measured. We did so by testing the few-shot performance of language models across 57 domains including elementary mathematics, US history, computer science, law, and more. What we found was that while ML models are indeed imbalanced—far better, for instance, at high school geography than high school physics—they are probably less imbalanced than humans on the same tasks, so savant is a wrong designation. (This is only one domain and doesn’t close the case—for instance AlphaZero is more savant-like.) We incidentally found that models were also poorly-calibrated, which may be an equally important issue to address.

Finally, to better understand the generalization of neural networks, we measured the bias-variance decomposition of the test error. This decomposes error into the variance (the error caused by random variation due to random initialization, choice of training data, etc.) and bias (the part of the error that is systematic across these random choices). This is in some sense a fairly rudimentary move as far as measurement goes (it replaces a single measurement with two measurements) but has been surprisingly fruitful. First, it helped explain strange “double descent” generalization curves that exhibit two peaks (they are the separate peaks in bias and variance). But it has also helped us to clarify other hypotheses. For instance, adversarially trained models tend to generalize poorly, and this is often explained via a two-dimensional conceptual example where the adversarial distribution boundary is more crooked (and thus higher complexity) than the regular decision boundary. While this conceptual example correctly predicts the bulk generalization behavior, it doesn’t always correctly predict the bias and variance individually, suggesting there is more to be said than the current story.

Optimization Power

Optimization power is, roughly, the degree to which an agent can shape its actions or environment towards accomplishing some objective. This concept often appears when discussing risks from AI—for instance, a paperclip-maximizing AI with too much optimization power might convert the whole world to paperclips; or recommendation systems with too much optimization power might create clickbait or addict users to the feed; or a system trained to maximize reported human happiness might one day take control of the reporting mechanism, rather than actually making humans happy (or perhaps make us happy at the expense of other values that promote flourishing).

In all of these cases, the concern is that future AI systems will (and perhaps already do) have a lot of optimization power, and this could lead to bad outcomes when optimizing even a slightly misspecified objective. But what does “optimization power” actually mean? Intuitively, it is the ability of a system to pursue an objective, but beyond that the concept is vague. (For instance, does GPT-3 or the Facebook newsfeed have more optimization power? How would we tell?).

To remedy this, I will propose two ways of measuring optimization power. The first is about “outer optimization” power (that is, of SGD or whatever other process is optimizing the learned parameters of the agent), while the second is about “inner optimization” power (that is, the optimization performed by the agent itself). Think of outer optimization as analogous to evolution and inner optimization as analogous to the learning a human does during their lifetime (both will be defined later). Measuring outer optimization will primarily tell us about the reward hacking concerns discussed above, while inner optimization will provide information about “take-off speed” (how quickly AI capabilities might improve from subhuman to superhuman).

Outer Optimization

By outer optimization, we mean the process by which stochastic gradient descent (SGD) or some other algorithm shapes the learned parameters of an agent or system. The concern is that SGD, given enough computation, explores a vast space of possibilities and might find unintended corners of that possibility space.

To measure this, I propose the following. In many cases, a system’s objective function has some set of hyperparameters or other design choices (for instance, Facebook might have some numerical trade-off between likes, shares, and other metrics). Suppose we perturb these parameters slightly (e.g. change the stated trade-off between likes and shares), and then re-optimize for that new objective. We then measure: How much worse is the system according to the original objective?

I think this directly gets at the issue of hacking mis-specified rewards, because it measures how much worse you would do if you got the objective function slightly wrong. And, it is a concrete number that we can measure, at least in principle! If we collected this number for many important AI systems, and tracked how it changes over time, then we could see if reward hacking is getting worse over time and course correct if need be.

Challenges and Research Questions. There are some issues with this metric. First, it relies on the fairly arbitrary set of parameters that happen to appear in the objective function. These parameters may or may not capture the types of mis-specification that are likely to occur for the true reward function. We could potentially address this by including additional perturbations, but this is an additional design choice and it would be better to have some guidance on what perturbations to use, or a more intrinsic way of perturbing the objective.

Another issue is that in many important cases, re-optimizing the objective isn’t actually feasible. For instance, Facebook couldn’t just create an entire copy of Facebook to optimize. Even for agents that don’t interact with users, such as OpenAI’s DotA agent, a single run of training might be so expensive that doing it even twice is infeasible, let alone continuously re-optimizing.

Thus, while the re-optimization measurement in principle provides a way to measure reward hacking, we need some technical ideas to tractably simulate it. I think ideas from causal inference could be relevant: we could treat the perturbed objective as a causal intervention, and use something like propensity weighting to simulate its effects. Other ideas such as influence functions might also help. However, we would need to scale these techniques to the complex, non-parametric, non-convex models that are used in modern ML applications.

Inner Optimization

Finally, we turn to inner optimization. Inner optimization is the optimization that an agent itself performs durings its own “lifetime”. It is important because an agent’s inner optimization can potentially be much faster than the outer optimization that shapes it. For instance, while evolution is an optimization process that shapes humans and other organisms, humans themselves also optimize the environment, and on a much faster time scale than evolution. Indeed, while evolution was the dominant force for much of Earth’s history, humans have been the dominant force in recent millenia. This underscores another important point—when inner optimization outpaces outer optimization, it can lead to a phase transition in what outcomes are pursued.

With all of this said, I think it’s an error to refer to humans (or ML agents) as “optimizers”. While humans definitely pursue goals, we rarely do so optimally, and most of our behavior is based on adapted reflexes and habits rather than actually taking the argmax of some function. I think it’s better to think of humans as “adapters” rather than “optimizers”, and so by inner optimization I really will mean inner adaptation.

To further clarify this, I think it helps to see how adapting behaviors arise in humans and other animals. Evolution changes organisms across generations to be adapted to their environment, and so it’s not obvious that organisms need to further adapt during their lifetime—we could imagine a simple insect that only performs reflex actions and never learns or adapts. However, when environments themselves can change over time, it pays for an organism to not only be adapted by evolution but also adaptable within its own lifetime. Planning and optimizing can be thought of as certain limiting types of adaptation machinery—they are general-purpose and so can be applied even in novel situations, but are often less efficient than special-purpose solutions.

Returning to ML, we can measure inner adaptation by looking at how much an agent changes over its own lifetime, relative to how much it changes across SGD steps. For example, we could take a language model trained on the web, then deploy it on a corpus of books (but not update its learned parameters). What does its average log-loss on the next book look like after having “read” 0, 1, 2, 5, 10, 50, etc. books already? The greater the degree to which this increases, the more we will say the agent performs inner adaptation.

In practice, for ML systems today we probably want to feed the agent random sentences or paragraphs to adapt to, rather than entire books. Even for these shorter sequences, we know that state-of-the-art language models perform limited inner adaptation, because transformers have a fixed context length and so can only adapt to a small bounded number of previous inputs. Interestingly, previously abandoned architectures such as recurrent networks could in principle adapt to arbitrarily long sequences.

That being said, I think we should expect more inner optimization in the near future, as AI personal assistants become viable. These agents will have longer lifetimes and will need to adapt much more precisely to the individual people that they assist. This may lead to a return to recurrent or other stateful architectures, and the long lifetime and high intra-person and cross-time heterogeneity will incentivize inner adaptation.

The above discussion illustrates the value of measurement. Even defining the measurement (without yet taking it) clarifies our thinking, and thus leads to insight about transformers vs. RNNs and the possible consequences of AI personal assistants.

Research Questions. As with outer optimization, there are still free variables to consider in defining inner adaptation. For instance, how much is “a lot” of inner adaptation? Even existing models probably adapt more in their lifetime than the equivalent of 1 step of SGD, but that’s because a single SGD step isn’t much. Should we then compare to 1000 steps of SGD? 1 million? Or should we normalize against a different metric entirely?

There are also free variables in how we define the agent’s lifetime and what type of environment we ask it to adapt to. For instance, books or tweets? Random sentences, random paragraphs, or something else? We probably need more empirical work to understand which of these makes the most sense.

Take-off Speed. The measurement thought experiment is also useful for thinking about “take-off speeds”, or the rate at which AI capabilities progress from subhuman to superhuman. Many discussions focus on whether take-off will be slow/continuous or fast/discontinuous. The outer/inner distinction shows that it’s not really about continuous vs. discontinuous, but about two continuous processes (SGD and the agent’s own adaptation) that operate on very different time scales. However, we can in principle measure both, and similar to pollution externalities I expect we would be able to see inner adaptation increasing before it became the overwhelming dynamic.

Does that mean we shouldn’t be worried about fast take-offs? I think not, because the analogy with evolution suggests that inner adaptation could run much faster than SGD, and there might be a quick transition from really boring adaptation (reptiles) to something that’s a super-big deal (humans). So, we should want ways to very sensitively measure this quantity and worry if it starts to increase exponentially, even if it starts at a low level (just like with COVID-19!).

Sets with Small Intersection

2017-03-17T00:00:00-07:00

Suppose that we want to construct subsets $S_1, \ldots, S_m \subseteq \{1,\ldots,n\}$ with the following properties:

$ S_i \geq k$ for all $i$
$ S_i \cap S_j \leq 1$ for all $i \neq j$

The goal is to construct as large a family of such subsets as possible (i.e., to make $m$ as large as possible). If $k \geq 2\sqrt{n}$, then up to constants it is not hard to show that the optimal number of sets is $\frac{n}{k}$ (that is, the trivial construction with all sets disjoint is essentially the best we can do).

Here I am interested in the case when $k \ll \sqrt{n}$. In this case I claim that we can substantially outperform the trivial construction: we can take $m = \Omega(n^2 / k^3)$. The proof is a very nice application of the asymmetric Lovasz Local Lemma. (Readers can refresh their memory here on what the asymmetric LLL says.)

Proof. We will take a randomized construction. For $i \in \{1,\ldots,n\}$, $j \in \{1,\ldots,m\}$, let $X_{i,a}$ be the event that $i \in S_a$. We will take the $X_{i,a}$ to be independent each with probability $\frac{2k}{n}$. Also define the events

\[Y_{i,j,a,b} = I[i \in S_a \wedge j \in S_a \wedge i \in S_b \wedge j \in S_b]\] \[Z_{a} = I[|S_a| < k]\]

It suffices to show that with non-zero probability, all of the $Y_{i,j,a,b}$ and $Z_{a}$ are false. Note that each $Y_{i,j,a,b}$ depends on $Y_{i’,j,a’,b}, Y_{i’,j,a,b’}, Y_{i,j’,a’,b}, Y_{i,j’,a,b’}, Z_a, Z_b$, and each $Z_a$ depends on $Y_{i,j,a,b}$. Thus each $Y$ depends on at most $4nm$ other $Y$ and $2$ other $Z$, and each $Z$ depends on at most $n^2m/2$ of the $Y$. Also note that $P(Y_{i,j,a,b}) = (k/n)^4$ and $P(Z_a) \leq \exp(-k/4)$ (by the Chernoff bound). It thus suffices to find constants $y, z$ such that

\[(k/n)^4 \leq y(1-y)^{4nm}(1-z)^2\] \[exp(-k/4) \leq z(1-y)^{n^2m/2}\]

We will guess $y = \frac{k}{n^2m}$, $z = \frac{1}{2}$, in which case the bottom inequality is approximately $\exp(-k/4) \leq \frac{1}{2}\exp(-k/2)$ (which is satisfied for large enough $k$, and the top inequality is approximately $\frac{k^4}{n^4} \leq \frac{k}{4n^2m} \exp(-4k/n)$, which is satisfied for $m \leq \frac{n^2}{4ek^3}$ (assuming $k \leq n/4$). Hence in particular we can indeed take $m = \Omega(n^2/k^3)$, as claimed.

Advice for Authors

2017-02-28T00:00:00-08:00

I’ve spent much of the last few days reading various ICML papers and I find there’s a few pieces of feedback that I give consistently across several papers. I’ve collated some of these below. As a general note, many of these are about local style rather than global structure; I think that good local style probably contributes substantially more to readability than global structure and is in general under-rated. I’m in general pretty willing to break rules about global structure (such as even having a conclusion section in the first place! though this might cause reviewers to look at your paper funny), but not to break local stylistic rules without strong reasons.

General Writing Advice

Be precise. This isn’t about being pedantic, but about maximizing information content. Choose your words carefully so that you say what you mean to say. For instance, replace “performance” with “accuracy” or “speed” depending on what you mean.
Be concise. Most of us write in an overly wordy style, because it’s easy to and no one drilled it out of us. Not only does wordiness decrease readability, it wastes precious space if you have a page limit.
Avoid complex sentence structure. Most research is already difficult to understand and digest; there’s no reason to make it harder by having complex run-on sentences.
Use consistent phrasing. In general prose, we’re often told to refer to the same thing in different ways to avoid boring the reader, but in technical writing this will lead to confusion. Hopefully your actual results are interesting enough that the reader doesn’t need to be entertained by your large vocabulary.

Abstract

There’s more than one approach to writing a good abstract, and which one you take will depend on the sort of paper you’re writing. I’ll give one approach that is good for papers presenting an unusual or unfamiliar idea to readers.
The first sentence / phrase should be something that all readers will agree with. The second should be something that many readers would find surprising, or wouldn’t have thought about before; but it should follow from (or at least be supported by) the first sentence. The general idea is that you need to start by warming the reader up and putting them in the right context, before they can appreciate your brilliant insight.
Here’s an example from my Reified Context Models paper: “A classic tension exists between exact inference in a simple model and approximate inference in a complex model. The latter offers expressivity and thus accuracy, but the former provides coverage of the space, an important property for confidence estimation and learning with indirect supervision.” Note how the second sentence conveys a non-obvious claim — that coverage is important for confidence estimation as well as for indirect supervision. It’s tempting to lead with this in order to make the first sentence more punchy, but this will tend to go over reader’s heads. Imagine if the abstract had started, “In the context of inference algorithms, coverage of the space is important for confidence estimation and indirect supervision.” No one is going to understand what that means.

Introduction

The advice in this section is most applicable to the introduction section (and maybe related work and discussion), but applies on some level to other parts of the paper as well.
Many authors (myself included) end up using phrases like “much recent interest” and “increasingly important” because these phrases show up frequently in academic papers, and they are vague enough to be defensible. Even though these phrases are common, they are bad writing! They are imprecise and rely on hedge words to avoid having to explain why something is interesting or important.
Make sure to provide context before introducing a new concept; if you suddenly start talking about “NP-hardness” or “local transformations”, you need to first explain to the reader why this is something that should be considered in the present situation.
Don’t beat around the bush; if the point is “A, therefore B” (where B is some good fact about your work), then say that, rather than being humble and just pointing out A.
Don’t make the reader wait for the payoff; spell it out in the introduction. I frequently find that I have to wait until Section 4 to find out why I should care about a paper; while I might read that far, most reviewers are going to give up about halfway through Section 1. (Okay, that was a bit of an exaggeration; they’ll probably wait until the end of Section 1 before giving up.)

Conclusion / Discussion

I generally put in the conclusion everything that I wanted to put in the introduction, but couldn’t because readers wouldn’t be able to appreciate the context without reading the rest of the paper first. This is a relatively straightforward way to write a conclusion that isn’t just a re-hash of the introduction.
The conclusion can also be a good place to discuss open questions that you’d like other researchers to think about.
My model is that only the ~5 people most interested in your paper are going to actually read this section, so it’s worth somewhat tailoring to that audience. Unfortunately, the paper reviewers might also read this section, so you can’t tailor it too much or the reviewers might get upset if they end up not being in the target audience.
For theory papers, having a conclusion is completely optional (I usually skip it). In this case, open problems can go in the introduction. If you’re submitting a theory paper to NIPS or ICML, you unfortunately need a conclusion or reviewers will get upset. In my opinion, this is an instance where peer review makes the paper worse rather than better.

LaTeX

Proper citation style: one should write “Widgets are awesome (Smith, 2001).” or “Smith (2001) shows that widgets are awesome.” but never “(Smith, 2001) shows that widgets are awesome.” You can control this in LaTeX using \citep{} and \citet{} if you use natbib.
Display equations can take up a lot of space if over-used, but at the same time, too many in-line equations can make your document hard to read. Think carefully about which equations are worth displaying, and whether your in-line equations are becoming too dense.
If leave a blank line after \end{equation} or \$\$, you will create an extra line break in the document. This is sort of annoying because white-space isn’t supposed to matter in that way, but you can save a lot of space by remembering this.
DON’T use the fullpage package. I’m used to using \usepackage{fullpage} in documents to get the margins that I want, but this will override options in many style files (including jmlr.sty which is used in machine learning).
\left( and \right) can be convenient for auto-sizing parentheses, but are often overly conservative (e.g. making parentheses too big due to serifs or subscripts). It’s fine to use \left( and \right) initially, but you might want to specify explicit sizes with \big(, \Big(, \bigg(, etc. in the final pass.
When displaying a sequence of equations (e.g. with the align environment), use \stackrel{} on any non-trivial equality or inequality statements and justify these steps immediately after the equation. See the bottom of page 6 of this paper for an example.
Make sure that \label{} commands come after the \caption{} command in a figure (rather than before), otherwise your numbering will be wrong.

Math

When using a variable that hasn’t appeared in a while, remind the reader what it is (i.e., “the sample space $\mathcal{X}$” rather than “$\mathcal{X}$”.
If it’s one of the main points of your work, call it a Theorem. If it’s a non-trivial conclusion that requires a somewhat involved argument (but it’s not a main point of the work), call it a Proposition. If the proof is short or routine, call it a Lemma, unless it follows directly from a Theorem you just stated, in which case call it a Corollary.
As a general rule there shouldn’t be more than 3 theorems in your paper (probably not more than 1). If you think this is unreasonable, consider that my COLT 2015 paper has 3 theorems across 24 pages, and my STOC 2017 paper has 2 theorems across 47 pages (not counting stating the same theorem in multiple locations).
If you just made a mathematical argument in the text that ended up with a non-trivial conclusion, you probably want to encapsulate it in a Proposition or Theorem. (Better yet, state the theorem before the argument so that the reader knows what you’re arguing for; although this isn’t always the best ordering.)

Model Mis-specification and Inverse Reinforcement Learning

2017-02-07T00:00:00-08:00

In my previous post, “Latent Variables and Model Mis-specification”, I argued that while machine learning is good at optimizing accuracy on observed signals, it has less to say about correctly inferring the values for unobserved variables in a model. In this post I’d like to focus in on a specific context for this: inverse reinforcement learning (Ng et al. 2000, Abeel et al. 2004, Ziebart et al. 2008, Ho et al 2016), where one observes the actions of an agent and wants to infer the preferences and beliefs that led to those actions. For this post, I am pleased to be joined by Owain Evans, who is an active researcher in this area and has co-authored an online book about building models of agents (see here in particular for a tutorial on inverse reinforcement learning and inverse planning).

Owain and I are particularly interested in inverse reinforcement learning (IRL) because it has been proposed (most notably by Stuart Russell) as a method for learning human values in the context of AI safety; among other things, this would eventually involve learning and correctly implementing human values by artificial agents that are much more powerful, and act with much broader scope, than any humans alive today. While we think that overall IRL is a promising route to consider, we believe that there are also a number of non-obvious pitfalls related to performing IRL with a mis-specified model. The role of IRL in AI safety is to infer human values, which are represented by a reward function or utility function. But crucially, human values (or human reward functions) are never directly observed.

Below, we elaborate on these issues. We hope that by being more aware of these issues, researchers working on inverse reinforcement learning can anticipate and address the resulting failure modes. In addition, we think that considering issues caused by model mis-specification in a particular concrete context can better elucidate the general issues pointed to in the previous post on model mis-specification.

Specific Pitfalls for Inverse Reinforcement Learning

In “Latent Variables and Model Mis-specification”, Jacob talked about model mis-specification, where the “true” model does not lie in the model family being considered. We encourage readers to read that post first, though we’ve also tried to make the below readable independently.

In the context of inverse reinforcement learning, one can see some specific problems that might arise due to model mis-specification. For instance, the following are things we could misunderstand about an agent, which would cause us to make incorrect inferences about the agent’s values:

The actions of the agent. If we believe that an agent is capable of taking a certain action, but in reality they are not, we might make strange inferences about their values (for instance, that they highly value not taking that action). Furthermore, if our data is e.g. videos of human behavior, we have an additional inference problem of recognizing actions from the frames.
The information available to the agent. If an agent has access to more information than we think it does, then a plan that seems irrational to us (from the perspective of a given reward function) might actually be optimal for reasons that we fail to appreciate. In the other direction, if an agent has less information than we think, then we might incorrectly believe that they don’t value some outcome A, even though they really only failed to obtain A due to lack of information.
The long-term plans of the agent. An agent might take many actions that are useful in accomplishing some long-term goal, but not necessarily over the time horizon that we observe the agent. Inferring correct values thus also requires inferring such long-term goals. In addition, long time horizons can make models more brittle, thereby exacerbating model mis-specification issues.

There are likely other sources of error as well. The general point is that, given a mis-specified model of the agent, it is easy to make incorrect inferences about an agent’s values if the optimization pressure on the learning algorithm is only towards predicting actions correctly in-sample.

In the remainder of this post, we will cover each of the above aspects – actions, information, and plans – in turn, giving both quantitative models and qualitative arguments for why model mis-specification for that aspect of the agent can lead to perverse beliefs and behavior. First, though, we will briefly review the definition of inverse reinforcement learning and introduce relevant notation.

Inverse Reinforcement Learning: Definition and Notations

In inverse reinforcement learning, we want to model an agent taking actions in a given environment. We therefore suppose that we have a state space $S$ (the set of states the agent and environment can be in), an action space $A$ (the set of actions the agent can take), and a transition function $T(s’ \mid s,a)$, which gives the probability of moving from state $s$ to state $s’$ when taking action $a$. For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car. The tuple of $(S,A,T)$ is called an $MDP\backslash R$, which is a Markov Decision Process without a reward function. (The $MDP\backslash R$ will either have a known horizon or a discount rate $\gamma$ but we’ll leave these out for simplicity.)

Figure 1: Diagram showing how IRL and RL are related. (Credit: Pieter Abbeel’s slides on IRL)

The inference problem for IRL is to infer a reward function $R$ given an optimal policy $\pi^* : S \to A$ for the $MDP\backslash R$ (see Figure 1). We learn about the policy $\pi^*$ from samples $(s,a)$ of states and the corresponding action according to $\pi^*$ (which may be random). Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:

$(s_0, a_0), (s_1, a_1), \ldots, (s_n, a_n) $

In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).

Given the $MDP\backslash R$ and the observed trajectory, the goal is to infer the reward function $R$. In a Bayesian framework, if we specify a prior on $R$ we have:

$P(R \mid s_{0:n},a_{0:n}) \propto P( s_{0:n},a_{0:n} \mid R) P(R) = P(R) \cdot \prod_{i=0}^n P( a_i \mid s_i, R)$

The likelihood $P(a_i \mid s_i, R)$ is just $\pi_R(s)[a_i]$, where $\pi_R$ is the optimal policy under the reward function $R$. Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning (see Figure 1). Policies are usually assumed to be noisy (e.g. using a softmax instead of deterministically taking the best action). Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective (see the references in the introduction for some examples).

Recognizing Human Actions in Data

IRL is a promising approach to learning human values in part because of the easy availability of data. For supervised learning, humans need to produce many labeled instances specialized for a task. IRL, by contrast, is an unsupervised/semi-supervised approach where any record of human behavior is a potential data source. Facebook’s logs of user behavior provide trillions of data-points. YouTube videos, history books, and literature are a trove of data on human behavior in both actual and imagined scenarios. However, while there is lots of existing data that is informative about human preferences, we argue that exploiting this data for IRL will be a difficult, complex task with current techniques.

Inferring Reward Functions from Video Frames

As we noted above, applications of IRL typically infer the reward function R from observed samples of the human policy $\pi^*$. Formally, the environment is a known $MDP\backslash R = (S,A,T)$ and the observations are state-action pairs, $(s,a) \sim pi^*$. This assumes that (a) the environment’s dynamics $T$ are given as part of the IRL problem, and (b) the observations are structured as “state-action” pairs. When the data comes from a human expert parking a car, these assumptions are reasonable. The states and actions of the driver can be recorded and a car simulator can be used for $T$. For data from YouTube videos or history books, the assumptions fail. The data is a sequence of partial observations: the transition function $T$ is unknown and the data does not separate out state and action. Indeed, it’s a challenging ML problem to infer human actions from text or videos.

Movie still: What actions are being performed in this situation? (Source)

As a concrete example, suppose the data is a video of two co-pilots flying a plane. The successive frames provide only limited information about the state of the world at each time step and the frames often jump forward in time. So it’s more like a POMDP with a complex observation model. Moreover, the actions of each pilot need to be inferred. This is a challenging inference problem, because actions can be subtle (e.g. when a pilot nudges the controls or nods to his co-pilot).

To infer actions from observations, some model relating the true state-action $(s,a)$ to the observed video frame must be used. But choosing any model makes substantive assumptions about how human values relate to their behavior. For example, suppose someone attacks one of the pilots and (as a reflex) he defends himself by hitting back. Is this reflexive or instinctive response (hitting the attacker) an action that is informative about the pilot’s values? Philosophers and neuroscientists might investigate this by considering the mental processes that occur before the pilot hits back. If an IRL algorithm uses an off-the-shelf action classifier, it will lock in some (contentious) assumptions about these mental processes. At the same time, an IRL algorithm cannot learn such a model because it never directly observes the mental processes that relate rewards to actions.

Inferring Policies From Video Frames

When learning a reward function via IRL, the ultimate goal is to use the reward function to guide an artificial agent’s behavior (e.g. to perform useful tasks to humans). This goal can be formalized directly, without including IRL as an intermediate step. For example, in Apprenticeship Learning, the goal is to learn a “good” policy for the $MDP\backslash R$ from samples of the human’s policy $\pi^*$ (where $\pi^*$ is assumed to approximately optimize an unknown reward function). In Imitation Learning, the goal is simply to learn a policy that is similar to the human’s policy.

Like IRL, policy search techniques need to recognize an agent’s actions to infer their policy. So they have the same challenges as IRL in learning from videos or history books. Unlike IRL, policy search does not explicitly model the reward function that underlies an agent’s behavior. This leads to an additional challenge. Humans and AI systems face vastly different tasks and have different action spaces. Most actions in videos and books would never be performed by a software agent. Even when tasks are similar (e.g. humans driving in the 1930s vs. a self-driving car in 2016), it is a difficult transfer learning problem to use human policies in one task to improve AI policies in another.

IRL Needs Curated Data

We argued that records of human behaviour in books and videos are difficult for IRL algorithms to exploit. Data from Facebook seems more promising: we can store the state (e.g. the HTML or pixels displayed to the human) and each human action (clicks and scrolling). This extends beyond Facebook to any task that can be performed on a computer. While this covers a broad range of tasks, there are obvious limitations. Many people in the world have a limited ability to use a computer: we can’t learn about their values in this way. Moreover, some kinds of human preferences (e.g. preferences over physical activities) seem hard to learn about from behaviour on a computer.

Information and Biases

Human actions depend both on their preferences and their beliefs. The beliefs, like the preferences, are never directly observed. For narrow tasks (e.g. people choosing their favorite photos from a display), we can model humans as having full knowledge of the state (as in an MDP). But for most real-world tasks, humans have limited information and their information changes over time (as in a POMDP or RL problem). If IRL assumes the human has full information, then the model is mis-specified and generalizing about what the human would prefer in other scenarios can be mistaken. Here are some examples:

(1). Someone travels from their house to a cafe, which has already closed. If they are assumed to have full knowledge, then IRL would infer an alternative preference (e.g. going for a walk) rather than a preference to get a drink at the cafe.

(2). Someone takes a drug that is widely known to be ineffective. This could be because they have a false belief that the drug is effective, or because they picked up the wrong pill, or because they take the drug for its side-effects. Each possible explanation could lead to different conclusions about preferences.

(3). Suppose an IRL algorithm is inferring a person’s goals from key-presses on their laptop. The person repeatedly forgets their login passwords and has to reset them. This behavior is hard to capture with a POMDP-style model: humans forget some strings of characters and not others. IRL might infer that the person intends to repeatedly reset their passwords.

Example (3) above arises from humans forgetting information – even if the information is only a short string of characters. This is one way in which humans systematically deviate from rational Bayesian agents. The field of psychology has documented many other deviations. Below we discuss one such deviation – time-inconsistency – which has been used to explain temptation, addiction and procrastination.

Time-inconsistency and Procrastination

An IRL algorithm is inferring Alice’s preferences. In particular, the goal is to infer Alice’s preference for completing a somewhat tedious task (e.g. writing a paper) as opposed to relaxing. Alice has $T$ days in which she could complete the task and IRL observes her working or relaxing on each successive day.

Figure 2. MDP graph for choosing whether to “work” or “wait” (relax) on a task.

Formally, let R be the preference/reward Alice assigns to completing the task. Each day, Alice can “work” (receiving cost $w$ for doing tedious work) or “wait” (cost $0$). If she works, she later receives the reward $R$ minus a tiny, linearly increasing cost (because it’s better to submit a paper earlier). Beyond the deadline at $T$, Alice cannot get the reward $R$. For IRL, we fix $\epsilon$ and $w$ and infer $R$.

Suppose Alice chooses “wait” on Day 1. If she were fully rational, it follows that R (the preference for completing the task) is small compared to $w$ (the psychological cost of doing the tedious work). In other words, Alice doesn’t care much about completing the task. Rational agents will do the task on Day 1 or never do it. Yet humans often care deeply about tasks yet leave them until the last minute (when finishing early would be optimal). Here we imagine that Alice has 9 days to complete the task and waits until the last possible day.

Figure 3: Graph showing IRL inferences for Optimal model (which is mis-specified) and Possibly Discounting Model (which includes hyperbolic discounting). On each day ($x$-axis) the model gets another observation of Alice’s choice. The $y$-axis shows the posterior mean for $R$ (reward for task), where the tedious work $w = -1$.

Figure 3 shows results from running IRL on this problem. There is an “Optimal” model, where the agent is optimal up to an unknown level of softmax random noise (a typical assumption for IRL). There is also a “Possibly Discounting” model, where the agent is either softmax optimal or is a hyperbolic discounter (with unknown level of discounting). We do joint Bayesian inference over the completion reward $R$, the softmax noise and (for “Possibly Discounting”) how much the agent hyperbolically discounts. The work cost $w$ is set to $-1$. Figure 3 shows that after 6 days of observing Alice procrastinate, the “Optimal” model is very confident that Alice does not care about the task $(R <

)$. When Alice completes the task on the last possible day, the posterior mean on R is not much more than the prior mean. By contrast, the “Possibly Discounting” model never becomes confident that Alice doesn’t care about the task. (Note that the gap between the models would be bigger for larger $T$. The “Optimal” model’s posterior on R shoots back to its Day-0 prior because it explains the whole action sequence as due to high softmax noise — optimal agents without noise would either do the task immediately or not at all. Full details and code are here.)

Long-term Plans

Agents will often take long series of actions that generate negative utility for them in the moment in order to accomplish a long-term goal (for instance, studying every night in order to perform well on a test). Such long-term plans can make IRL more difficult for a few reasons. Here we focus on two: (1) IRL systems may not have access to the right type of data for learning about long-term goals, and (2) needing to predict long sequences of actions can make algorithms more fragile in the face of model mis-specification.

(1) Wrong type of data. To make inferences based on long-term plans, it would be helpful to have coherent data about a single agent’s actions over a long period of time (so that we can e.g. see the plan unfolding). But in practice we will likely have substantially more data consisting of short snapshots of a large number of different agents (e.g. because many internet services already record user interactions, but it is uncommon for a single person to be exhaustively tracked and recorded over an extended period of time even while they are offline).

The former type of data (about a single representative population measured over time) is called panel data, while the latter type of data (about different representative populations measured at each point in time) is called repeated cross-section data. The differences between these two types of data is well-studied in econometrics, and a general theme is the following: it is difficult to infer individual-level effects from cross-sectional data.

An easy and familiar example of this difference (albeit not in an IRL setting) can be given in terms of election campaigns. Most campaign polling is cross-sectional in nature: a different population of respondents is polled at each point in time. Suppose that Hillary Clinton gives a speech and her overall support according to cross-sectional polls increases by 2%; what can we conclude from this? Does it mean that 2% of people switched from Trump to Clinton? Or did 6% of people switch from Trump to Clinton while 4% switched from Clinton to Trump?

At a minimum, then, using cross-sectional data leads to a difficult disaggregation problem; for instance, different agents taking different actions at a given point in time could be due to being at different stages in the same plan, or due to having different plans, or some combination of these and other factors. Collecting demographic and other side data can help us (by allowing us to look at variation and shifts within each subpopulation), but it is unclear if this will be sufficient in general.

On the other hand, there are some services (such as Facebook or Google) that do have extensive data about individual users across a long period of time. However, this data has another issue: it is incomplete in a very systematic way (since it only tracks online behaviour). For instance, someone might go online most days to read course notes and Wikipedia for a class; this is data that would likely be recorded. However, it is less likely that one would have a record of that person taking the final exam, passing the class and then getting an internship based on their class performance. Of course, some pieces of this sequence would be inferable based on some people’s e-mail records, etc., but it would likely be under-represented in the data relative to the record of Wikipedia usage. In either case, some non-trivial degree of inference would be necessary to make sense of such data.

(2) Fragility to mis-specification. Above we discussed why observing only short sequences of actions from an agent can make it difficult to learn about their long-term plans (and hence to reason correctly about their values). Next we discuss another potential issue – fragility to model mis-specification.

Suppose someone spends 99 days doing a boring task to accomplish an important goal on day 100. A system that is only trying to correctly predict actions will be right 99% of the time if it predicts that the person inherently enjoys boring tasks. Of course, a system that understands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor errors in its understanding could bring the accuracy back below 99%.

The general issue is the following: large changes in the model of the agent might only lead to small changes in the predictive accuracy of the model, and the longer the time horizon on which a goal is realized, the more this might be the case. This means that even slight mis-specifications in the model could tip the scales back in favor of a (very) incorrect reward function. A potential way of dealing with this might be to identify “important” predictions that seem closely tied to the reward function, and focus particularly on getting those predictions right (see here for a paper exploring a similar idea in the context of approximate inference).

One might object that this is only a problem in this toy setting; for instance, in the real world, one might look at the particular way in which someone is studying or performing some other boring task to see that it coherently leads towards some goal (in a way that would be less likely were the person to be doing something boring purely for enjoyment). In other words, correctly understanding the agent’s goals might allow for more fine-grained accurate predictions which would fare better under e.g. log-score than would an incorrect model.

This is a reasonable objection, but there are some historical examples of this going wrong that should give one pause. That is, there are historical instances where: (i) people expected a more complex model that seemed to get at some underlying mechanism to outperform a simpler model that ignored that mechanism, and (ii) they were wrong (the simpler model did better under log-score). The example we are most familiar with is n-gram models vs. parse trees for language modelling; the most successful language models (in terms of having the best log-score on predicting the next word given a sequence of previous words) essentially treat language as a high-order Markov chain or hidden Markov model, despite the fact that linguistic theory predicts that language should be tree-structured rather than linearly-structured. Indeed, NLP researchers have tried building language models that assume language is tree-structured, and these models perform worse, or at least do not seem to have been adopted in practice (this is true both for older discrete models and newer continuous models based on neural nets). It’s plausible that a similar issue will occur in inverse reinforcement learning, where correctly inferring plans is not enough to win out in predictive performance. The reason for the two issues might be quite similar (in language modelling, the tree structure only wins out in statistically uncommon corner cases involving long-term and/or nested dependencies, and hence getting that part of the prediction correct doesn’t help predictive accuracy much).

The overall point is: in the case of even slight model mis-specification, the “correct” model might actually perform worse under typical metrics such as predictive accuracy. Therefore, more careful methods of constructing a model might be necessary.

Learning Values != Robustly Predicting Human Behaviour

The problems with IRL described so far will result in poor performance for predicting human choices out-of-sample. For example, if someone is observed doing boring tasks for 99 days (where they only achieve the goal on Day 100), they’ll be predicted to continue doing boring tasks even when a short-cut to the goal becomes available. So even if the goal is simply to predict human behaviour (not to infer human values), mis-specification leads to bad predictions on realistic out-of-sample scenarios.

Let’s suppose that our goal is not to predict human behaviour but to create AI systems that promote and respect human values. These goals (predicting humans and building safe AI) are distinct. Here’s an example that illustrates the difference. Consider a long-term smoker, Bob, who would continue smoking even if there were (counterfactually) a universally effective anti-smoking treatment. Maybe Bob is in denial about the health effects of smoking or Bob thinks he’ll inevitably go back to smoking whatever happens. If an AI system were assisting Bob, we might expect it to avoid promoting his smoking habit (e.g. by not offering him cigarettes at random moments). This is not paternalism, where the AI system imposes someone else’s values on Bob. The point is that even if Bob would continue smoking across many counterfactual scenarios this doesn’t mean that he places value on smoking.

How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.

Predicting human actions is neither necessary nor sufficient for learning human values. In what ways, then, are the two related? One such way stems from the premise that if someone spends more resources making a decision, the resulting decision tends to be more in keeping with their true values. For instance, someone might spend lots of time thinking about the decision, they might consult experts, or they might try out the different options in a trial period before they make the real decision. Various authors have thus suggested that people’s choices under sufficient “reflection” act as a reliable indicator of their true values. Under this view, predicting a certain kind of behaviour (choices under reflection) is sufficient for learning human values. Paul Christiano has written about some proposals for doing this, though we will not discuss them here (the first link is for general AI systems while the second is for newsfeeds). In general, turning these ideas into algorithms that are tractable and learn safely remains a challenging problem.

Acknowledgments

Thanks to Sindy Li for reviewing a full draft of this post and providing many helpful comments. Thanks also to Michael Webb and Paul Christiano for doing the same on specific sections of the post.

Linear algebra fact

2017-02-06T00:00:00-08:00

Here is interesting linear algebra fact: let $A$ be an $n \times n$ matrix and $u$ be a vector such that $u^{\top}A = \lambda u^{\top}$. Then for any matrix $B$, $u^{\top}((A-B)(\lambda I - B)^{-1}) = u^{\top}$.

The proof is just basic algebra: $u^{\top}(A-B)(\lambda I - B)^{-1} = (\lambda u^{\top} - u^{\top}B)(\lambda I - B)^{-1} = u^{\top}(\lambda I - B)(\lambda I - B)^{-1} = u^{\top}$.

Why care about this? Let’s imagine that $A$ is a (not necessarily symmetric) stochastic matrix, so $1^{\top}A = 1^{\top}$. Let $A-B$ be a low-rank approximation to $A$ (so $A-B$ consists of all the large singular values, and $B$ consists of all the small singular values). Unfortunately since $A$ is not symmetric, this low-rank approximation doesn’t preserve the eigenvalues of $A$ and so we need not have $1^{\top}(A-B) = 1^{\top}$. The $(I-B)^{-1}$ can be thought of as a “correction” term such that the resulting matrix is still low-rank, but we’ve preserved one of the eigenvectors of $A$.

Prékopa–Leindler inequality

2017-02-05T00:00:00-08:00

Consider the following statements:

The shape with the largest volume enclosed by a given surface area is the $n$-dimensional sphere.
A marginal or sum of log-concave distributions is log-concave.
Any Lipschitz function of a standard $n$-dimensional Gaussian distribution concentrates around its mean.

What do these all have in common? Despite being fairly non-trivial and deep results, they all can be proved in less than half of a page using the Prékopa–Leindler inequality.

(I won’t show this here, or give formal versions of the statements above, but time permitting I will do so in a later blog post.)

Latent Variables and Model Mis-specification

2017-01-10T00:00:00-08:00

Machine learning is very good at optimizing predictions to match an observed signal — for instance, given a dataset of input images and labels of the images (e.g. dog, cat, etc.), machine learning is very good at correctly predicting the label of a new image. However, performance can quickly break down as soon as we care about criteria other than predicting observables. There are several cases where we might care about such criteria:

In scientific investigations, we often care less about predicting a specific observable phenomenon, and more about what that phenomenon implies about an underlying scientific theory.
In economic analysis, we are most interested in what policies will lead to desirable outcomes. This requires predicting what would counterfactually happen if we were to enact the policy, which we (usually) don’t have any data about.
In machine learning, we may be interested in learning value functions which match human preferences (this is especially important in complex settings where it is hard to specify a satisfactory value function by hand). However, we are unlikely to observe information about the value function directly, and instead must infer it implicitly. For instance, one might infer a value function for autonomous driving by observing the actions of an expert driver.

In all of the above scenarios, the primary object of interest – the scientific theory, the effects of a policy, and the value function, respectively – is not part of the observed data. Instead, we can think of it as an unobserved (or “latent”) variable in the model we are using to make predictions. While we might hope that a model that makes good predictions will also place correct values on unobserved variables as well, this need not be the case in general, especially if the model is mis-specified.

I am interested in latent variable inference because I think it is a potentially important sub-problem for building AI systems that behave safely and are aligned with human values. The connection is most direct for value learning, where the value function is the latent variable of interest and the fidelity with which it is learned directly impacts the well-behavedness of the system. However, one can imagine other uses as well, such as making sure that the concepts that an AI learns sufficiently match the concepts that the human designer had in mind. It will also turn out that latent variable inference is related to counterfactual reasoning, which has a large number of tie-ins with building safe AI systems that I will elaborate on in forthcoming posts.

The goal of this post is to explain why problems show up if one cares about predicting latent variables rather than observed variables, and to point to a research direction (counterfactual reasoning) that I find promising for addressing these issues. More specifically, in the remainder of this post, I will: (1) give some formal settings where we want to infer unobserved variables and explain why we can run into problems; (2) propose a possible approach to resolving these problems, based on counterfactual reasoning.

1 Identifying Parameters in Regression Problems

Suppose that we have a regression model $p_{\theta}(y \mid x)$, which outputs a probability distribution over $y$ given a value for $x$. Also suppose we are explicitly interested in identifying the “true” value of $\theta$ rather than simply making good predictions about $y$ given $x$. For instance, we might be interested in whether smoking causes cancer, and so we care not just about predicting whether a given person will get cancer ($y$) given information about that person ($x$), but specifically whether the coefficients in $\theta$ that correspond to a history of smoking are large and positive.

In a typical setting, we are given data points $(x_1,y_1), \ldots, (x_n,y_n)$ on which to fit a model. Most methods of training machine learning systems optimize predictive performance, i.e. they will output a parameter $\hat{\theta}$ that (approximately) maximizes $\sum_{i=1}^n \log p_{\theta}(y_i \mid x_i)$. For instance, for a linear regression problem we have $\log p_{\theta}(y_i \mid x_i) = -(y_i - \langle \theta, x_i \rangle)^2$. Various more sophisticated methods might employ some form of regularization to reduce overfitting, but they are still fundamentally trying to maximize some measure of predictive accuracy, at least in the limit of infinite data.

Call a model well-specified if there is some parameter $\theta^*$ for which $p_{\theta^*}(y \mid x)$ matches the true distribution over $y$, and call a model mis-specified if no such $\theta^*$ exists. One can show that for well-specified models, maximizing predictive accuracy works well (modulo a number of technical conditions). In particular, maximizing $\sum_{i=1}^n \log p_{\theta}(y_i \mid x_i)$ will (asymptotically, as $n \to \infty$) lead to recovering the parameter $\theta^*$.

However, if a model is mis-specified, then it is not even clear what it means to correctly infer $\theta$. We could declare the $\theta$ maximizing predictive accuracy to be the “correct” value of $\theta$, but this has issues:

While $\theta$ might do a good job of predicting $y$ in the settings we’ve seen, it may not predict $y$ well in very different settings.
If we care about determining $\theta$ for some scientific purpose, then good predictive accuracy may be an unsuitable metric. For instance, even though margarine consumption might correlate well with (and hence be a good predictor of) divorce rate, that doesn’t mean that there is a causal relationship between the two.

The two problems above also suggest a solution: we will say that we have done a good job of inferring a value for $\theta$ if $\theta$ can be used to make good predictions in a wide variety of situations, and not just the situation we happened to train the model on. (For the latter case of predicting causal relationships, the “wide variety of situations” should include the situation in which the relevant causal intervention is applied.)

Note that both of the problems above are different from the typical statistical problem of overfitting. Clasically, overfitting occurs when a model is too complex relative to the amount of data at hand, but even if we have a large amount of data the problems above could occur. This is illustrated in the following graph:

Here the blue line is the data we have ($x,y$), and the green line is the model we fit (with slope and intercept parametrized by $\theta$). We have more than enough data to fit a line to it. However, because the true relationship is quadratic, the best linear fit depends heavily on the distribution of the training data. If we had fit to a different part of the quadratic, we would have gotten a potentially very different result. Indeed, in this situation, there is no linear relationship that can do a good job of extrapolating to new situations, unless the domain of those new situations is restricted to the part of the quadratic that we’ve already seen.

I will refer to the type of error in the diagram above as mis-specification error. Again, mis-specification error is different from error due to overfitting. Overfitting occurs when there is too little data and noise is driving the estimate of the model; in contrast, mis-specification error can occur even if there is plenty of data, and instead occurs because the best-performing model is different in different scenarios.

2 Structural Equation Models

We will next consider a slightly subtler setting, which in economics is referred to as a structural equation model. In this setting we again have an output $y$ whose distribution depends on an input $x$, but now this relationship is mediated by an unobserved variable $z$. A common example is a discrete choice model, where consumers make a choice among multiple goods ($y$) based on a consumer-specific utility function ($z$) that is influenced by demographic and other information about the consumer ($x$). Natural language processing provides another source of examples: in semantic parsing, we have an input utterance ($x$) and output denotation ($y$), mediated by a latent logical form $z$; in machine translation, we have input and output sentences ($x$ and $y$) mediated by a latent alignment ($z$).

Symbolically, we represent a structural equation model as a parametrized probability distribution $p_{\theta}(y, z \mid x)$, where we are trying to fit the parameters $\theta$. Of course, we can always turn a structural equation model into a regression model by using the identity $p_{\theta}(y \mid x) = \sum_{z} p_{\theta}(y, z \mid x)$, which allows us to ignore $z$ altogether. In economics this is called a reduced form model. We use structural equation models if we are specifically interested in the unobserved variable $z$ (for instance, in the examples above we are interested in the value function for each individual, or in the logical form representing the sentence’s meaning).

In the regression setting where we cared about identifying $\theta$, it was obvious that there was no meaningful “true” value of $\theta$ when the model was mis-specified. In this structural equation setting, we now care about the latent variable $z$, which can take on a meaningful true value (e.g. the actual utility function of a given individual) even if the overall model $p_{\theta}(y,z \mid x)$ is mis-specified. It is therefore tempting to think that if we fit parameters $\theta$ and use them to impute $z$, we will have meaningful information about the actual utility functions of individual consumers. However, this is a notational sleight of hand — just because we call $z$ “the utility function” does not make it so. The variable $z$ need not correspond to the actual utility function of the consumer, nor does the consumer’s preferences even need to be representable by a utility function.

We can understand what goes wrong by consider the following procedure, which formalizes the proposal above:

Find $\theta$ to maximize the predictive accuracy on the observed data, $\sum_{i=1}^n \log p_{\theta}(y_i \mid x_i)$, where $p_{\theta}(y_i \mid x_i) = \sum_z p_{\theta}(y_i, z \mid x_i))$. Call the result $\theta_0$.
Using this value $\theta_0$, treat $z_i$ as being distributed according to $p_{\theta_0}(z \mid x_i,y_i)$. On a new value $x_+$ for which $y$ is not observed, treat $z_+$ as being distributed according to $p_{\theta_0}(z \mid x_+)$.

As before, if the model is well-specified, one can show that such a procedure asymptotically outputs the correct probability distribution over $z$. However, if the model is mis-specified, things can quickly go wrong. For example, suppose that $y$ represents what choice of drink a consumer buys, and $z$ represents consumer utility (which might be a function of the price, attributes, and quantity of the drink). Now suppose that individuals have preferences which are influenced by unmodeled covariates: for instance, a preference for cold drinks on warm days, while the input $x$ does not have information about the outside temperature when the drink was bought. This could cause any of several effects:

If there is a covariate that happens to correlate with temperature in the data, then we might conclude that that covariate is predictive of preferring cold drinks.
We might increase our uncertainty about $z$ to capture the unmodeled variation in $y$.
We might implicitly increase uncertainty by moving utilities closer together (allowing noise or other factors to more easily change the consumer’s decision).

In practice we will likely have some mixture of all of these, and this will lead to systematic biases in our conclusions about the consumers’ utility functions.

The same problems as before arise: while we by design place probability mass on values of $z$ that correctly predict the observation $y$, under model mis-specification this could be due to spurious correlations or other perversities of the model. Furthermore, even though predictive performance is high on the observed data (and data similar to the observed data), there is no reason for this to continue to be the case in settings very different from the observed data, which is particularly problematic if one is considering the effects of an intervention. For instance, while inferring preferences between hot and cold drinks might seem like a silly example, the design of timber auctions constitutes a much more important example with a roughly similar flavour, where it is important to correctly understand the utility functions of bidders in order to predict their behaviour under alternative auction designs (the model is also more complex, allowing even more opportunities for mis-specification to cause problems).

3 A Possible Solution: Counterfactual Reasoning

In general, under model mis-specification we have the following problems:

It is often no longer meaningful to talk about the “true” value of a latent variable $\theta$ (or at the very least, not one within the specified model family).
Even when there is a latent variable $z$ with a well-defined meaning, the imputed distribution over $z$ need not match reality.

We can make sense of both of these problems by thinking in terms of counterfactual reasoning. Without defining it too formally, counterfactual reasoning is the problem of making good predictions not just in the actual world, but in a wide variety of counterfactual worlds that “could” exist. (I recommend this paper as a good overview for machine learning researchers.)

While typically machine learning models are optimized to predict well on a specific distribution, systems capable of counterfactual reasoning must make good predictions on many distributions (essentially any distribution that can be captured by a reasonable counterfactual). This stronger guarantee allows us to resolve many of the issues discussed above, while still thinking in terms of predictive performance, which historically seems to have been a successful paradigm for machine learning. In particular:

While we can no longer talk about the “true” value of $\theta$, we can say that a value of $\theta$ is a “good” value if it makes good predictions on not just a single test distribution, but many different counterfactual test distributions. This allows us to have more confidence in the generalizability of any inferences we draw based on $\theta$ (for instance, if $\theta$ is the coefficient vector for a regression problem, any variable with positive sign is likely to robustly correlate with the response variable for a wide variety of settings).
The imputed distribution over a variable $z$ must also lead to good predictions for a wide variety of distributions. While this does not force $z$ to match reality, it is a much stronger condition and does at least mean that any aspect of $z$ that can be measured in some counterfactual world must correspond to reality. (For instance, any aspect of a utility function that could at least counterfactually result in a specific action would need to match reality.)
We will successfully predict the effects of an intervention, as long as that intervention leads to one of the counterfactual distributions considered.

(Note that it is less clear how to actually train models to optimize counterfactual performance, since we typically won’t observe the counterfactuals! But it does at least define an end goal with good properties.)

Many people have a strong association between the concepts of “counterfactual reasoning” and “causal reasoning”. It is important to note that these are distinct ideas; causal reasoning is a type of counterfactual reasoning (where the counterfactuals are often thought of as centered around interventions), but I think of counterfactual reasoning as any type of reasoning that involves making robustly correct statistical inferences across a wide variety of distributions. On the other hand, some people take robust statistical correlation to be the definition of a causal relationship, and thus do consider causal and counterfactual reasoning to be the same thing.

I think that building machine learning systems that can do a good job of counterfactual reasoning is likely to be an important challenge, especially in cases where reliability and safety are important, and necessitates changes in how we evaluate machine learning models. In my mind, while the Turing test has many flaws, one thing it gets very right is the ability to evaluate the accuracy of counterfactual predictions (since dialogue provides the opportunity to set up counterfactual worlds via shared hypotheticals). In contrast, most existing tasks focus on repeatedly making the same type of prediction with respect to a fixed test distribution. This latter type of benchmarking is of course easier and more clear-cut, but fails to probe important aspects of our models. I think it would be very exciting to design good benchmarks that require systems to do counterfactual reasoning, and I would even be happy to incentivize such work monetarily.

Acknowledgements

Thanks to Michael Webb, Sindy Li, and Holden Karnofsky for providing feedback on drafts of this post. If any readers have additional feedback, please feel free to send it my way.

Individual Project Fund: Further Details

2016-12-31T00:00:00-08:00

In my post on where I plan to donate in 2016, I said that I would set aside \$2000 for funding promising projects that I come across in the next year:

The idea behind the project fund is … [to] give in a low-friction way on scales that are too small for organizations like Open Phil to think about. Moreover, it is likely good for me to develop a habit of evaluating projects I come across and thinking about whether they could benefit from additional money (either because they are funding constrained, or to incentivize an individual who is on the fence about carrying the project out). Finally, if this effort is successful, it is possible that other EAs will start to do this as well, which could magnify the overall impact. I think there is some danger that I will not be able to allocate the \$2000 in the next year, in which case any leftover funds will go to next year’s donor lottery.

In this post I will give some further details about this fund. My primary goal is to give others an idea of what projects I am likely to consider funding, so that anyone who thinks they might be a good fit for this can get in contact with me. (I also expect many of the best opportunities to come from people that I meet in person but don’t necessarily read this blog, so I plan to actively look for projects throughout the year as well.)

I am looking to fund or incentivize projects that meet several of the criteria below:

The project is in the area of computer science, especially one of machine learning, cyber security, algorithmic game theory, or computational social choice. [Some other areas that I would be somewhat likely to consider, in order of plausibility: economics, statistics, political science (especially international security), and biology.]
The project either wouldn’t happen, or would seem less worthwhile / higher-effort without the funding.
The organizer is someone who either I or someone I trust has an exceptionally high opinion of.
The project addresses a topic that I personally think is highly important. High-level areas that I tend to care about include international security, existential risk, AI safety, improving political institutions, improving scientific institutions, and helping the global poor. Technical areas that I tend to care about include reliable machine learning, machine learning and security, counterfactual reasoning, and value learning. On the other hand, if you have a project that you feel has a strong case for importance but doesn’t fit into these areas, I am interested in hearing about it.
It is unlikely that this project or a substantially similar project would be done by someone else at a similar level of quality. (Or, whoever else is likely to do it would instead focus on a similarly high-value project, if this one were to be taken care of.)
The topic pertains to a technical area that I or someone I trust has a high degree of expertise in, and can evaluate more quickly and accurately than a non-specialized funder.

It isn’t necessary to meet all of the criteria above, but I would probably want most things I fund to meet at least 4 of these 6.

Here are some concrete examples of things I might fund:

Someone is thinking of doing a project that is undervalued (in terms of career benefits) but would be very useful. They don’t feel excited about allocating time to a non-career-relevant task but would feel more excited if getting an award of \$1000 for their efforts.
Someone I trust is starting a new discussion group in an area that I think is important, but can’t find anyone to sponsor it, and wants money for providing food at the meetings.
Someone wants to do an experiment that I find valuable, but needs more compute resources than they have, and could use money for buying AWS hours.
Someone wants to curate a valuable dataset and needs money for hiring mechanical turkers.
Someone is organizing a workshop and needs money for securing a venue.
One project I am particularly interested in is a good survey paper at the intersection of machine learning and cyber security. If you might be interested in doing this, I would likely be willing to pay you.
There are likely many projects in the area of political activism that I would be interested in funding, although (due to crowdedness concerns) I have a particularly high bar for this area in terms of the criteria I laid out above.

If you think you might have a project that could use funding, please get in touch with me at jacob.steinhardt@gmail.com. Even if you are not sure if your project would be a good target for funding, I am very happy to talk to you about it. In addition, please feel free to comment either here or via e-mail if you have feedback on this general idea, or thoughts on types of small-project funding that I missed above.

Donations for 2016

2016-12-28T00:00:00-08:00

The following explains where I plan to donate in 2016, with some of my thinking behind it. This year, I had \$10,000 to allocate (the sum of my giving from 2015 and 2016, which I lumped together for tax reasons; although I think this was a mistake in retrospect, both due to discount rates and because I could have donated in January and December 2016 and still received the same tax benefits).

To start with the punch line: I plan to give \$4000 to the EA donor lottery, \$2500 to GiveWell for discretionary granting, \$2000 to be held in reserve to fund promising projects, \$500 to GiveDirectly, \$500 to the Carnegie Endowment (earmarked for the Carnegie-Tsinghua Center), and \$500 to the Blue Ribbon Study Panel.

For those interested in donating to any of these: instructions for the EA donor lottery and the Blue Ribbon Study Panel are in the corresponding links above, and you can donate to both GiveWell and GiveDirectly at this page. I am looking in to whether it is possible for small donors to give to the Carnegie Endowment, and will update this page when I find out.

At a high level, I partitioned my giving into two categories, which are roughly (A) “help poor people right now” and (B) “improve the overall trajectory of civilization” (these are meant to be rough delineations rather than rigorous definitions). I decided to split my giving into 30% category A and 70% category B. This is because while I believe that category B is the more pressing and impactful category to address in some overall utilitarian sense, I still feel a particular moral obligation towards helping the existing poor in the world we currently live in, which I don’t feel can be discharged simply by giving more to category B. The 30-70 split is meant to represent the fact that while category B seems more important to me, category A still receives substantial weight in my moral calculus (which isn’t fully utilitarian or even consequentialist).

The rest of this post treats categories A and B each in turn.

Category A: The Global Poor

Out of \$3000 in total, I decided to give \$2500 to GiveWell for discretionary regranting (which will likely be disbursed roughly but not exactly according to GiveWell’s recommended allocation), and \$500 to some other source, with the only stipulation being that it did not exactly match GiveWell’s recommendation. The reason for this was the following: while I expect GiveWell’s recommendation to outperform any conclusion that I personally reach, I think there is substantial value in the exercise of personally thinking through where to direct my giving. A few more specific reasons:

Most importantly, while I think that offloading giving decisions to a trusted expert is the correct decision to maximize the impact of any individual donation, collectively it leads to a bad equilibrium where substantially fewer and less diverse brainpower is devoted to thinking about where to give. I think that giving a small but meaningful amount based on one’s own reasoning largely ameliorates this effect without losing much direct value.
In addition, I think it is good to build the skills to in principle think through where to direct resources, even if in practice most of the work is outsourced to a dedicated organization.
Finally, having a large number of individual donors check GiveWell’s work and search for alternatives creates stronger incentives for GiveWell to do a thorough job (and allows donors to have more confidence that GiveWell is doing a thorough job). While I know many GiveWell staff and believe that they would do an excellent job independently of external vetting, I still think this is good practice.

Related to the last point: doing this exercise gave me a better appreciation for the overall reliability, strengths, and limitations of GiveWell’s work. In general, I found that GiveWell’s work was incredibly thorough (more-so than I expected despite my high opinion of them), and moreover that they have moved substantial money beyond the publicized annual donor recommendations. An example of this is their 2016 grant to IDinsight. IDinsight ended up being one of my top candidates for where to donate, such that I thought it was plausibly even better than a GiveWell top charity. However, when I looked into it further it turned out that GiveWell had already essentially filled their entire funding gap.

I think this anecdote serves to illustrate a few things: first, as noted, GiveWell is very thorough, and does substantial work beyond what is apparent from the top charities page. Second, while GiveWell had already given to IDinsight, the grant was made in 2016. I think the same process I used would not have discovered IDinsight in 2015, but it’s possible that other processes would have. So, I think it is possible that a motivated individual could identify strong giving opportunities a year ahead of GiveWell. As a point against this, I think I am in an unusually good position to do this and still did not succeed. I also think that even if an individual identified a strong opportunity, it is unlikely that they could be confident that it was strong, and in most cases GiveWell’s top charities would still be better bets in expectation (but I think that merely identifying a plausibly strong giving opportunity should count as a huge success for the purposes of the overall exercise).

To elaborate on why my positioning might be atypically good: I already know GiveWell staff and so have some appreciation for their thinking, and I work at Stanford and have several friends in the economics department, which is one of the strongest departments in the world for Development Economics. In particular, I discussed my giving decisions extensively with a student of Pascaline Dupas, who is one of the world experts in the areas of economics most relevant to GiveWell’s recommendations.

Below are specifics on organizations I looked into and where I ultimately decided to give.

Object-level Process and Decisions (Category A)

My process for deciding where to give mostly consisted of talking to several people I trust, brainstorming and thinking things through myself, and a small amount of online research. (I think that I should likely have done substantially more online research than I ended up doing, but my thinking style tends to benefit from 1-on-1 discussions, which I also find more enjoyable.) The main types of charities that I ended up considering were:

GiveDirectly (direct cash transfers)
IPA/JPAL and similar groups (organizations that support academic research on international development)
IDinsight and similar groups (similar to the previous group, but explicitly tries to do the “translational work” of going from academic research to evidence-backed large-scale interventions)
public information campaigns (such as Development Media International)
animal welfare
start-ups or other small groups in the development space that might need seed funding
meta-charities such as CEA that try to increase the amount of money moved to EA causes (or evidence-backed charity more generally)

I ultimately felt unsure whether animal welfare should count in this category, and while I felt that CEA was a potentially strong candidate in terms of pure cost-effectiveness, directing funds there felt overly insular/meta to me in a way that defeated the purpose of the giving exercise. (Note: two individuals who reviewed this post encouraged me to revisit this point; as a result, next year I plan to look into CEA in more detail.)

While looking into the “translational work” category, I came across one organization other than IDinsight that did work in this area and was well-regarded by at least some economists. While I was less impressed by them than I was by IDinsight, they seemed plausibly strong, and it turned out that GiveWell had not yet evaluated them. While I ended up deciding not to give to them (based on feeling that IDinsight was likely to do substantially better work in the same area) I did send GiveWell an e-mail bringing the organization to their attention.

When looking into IPA, my impression was that while they have been responsible for some really good work in the past, this was primarily while they were a smaller organization, and they have now become large and bureaucratic enough that their future value will be substantially lower. However, I also found out about an individual who was running a small organization in the same space as IPA, and seemed to be doing very good work. While I was unable to offer them money for reasons related to conflict of interest, I do plan to try to find ways to direct funds to them if they are interested.

While public information campaigns seem like they could a priori be very effective, briefly looking over GiveWell’s page on DMI gave me the impression that GiveWell had already considered this area in a great deal of depth and prioritized other interventions for good reasons.

I ultimately decided to give my money to GiveDirectly. While in some sense this violates the spirit of the exercise, I felt satisfied about having found at least one potentially good giving opportunity (the small IPA-like organization) even if I was unable to give to it personally, and overall felt that I had done a reasonable amount of research. Moreover, I have a strong intuition that 0% is the wrong allocation for GiveDirectly, and it wasn’t clear to me that GiveWell’s reasons for recommending 0% were strong enough to override that intuition.

So, overall, \$2500 of my donation will go to GiveWell for discretionary re-granting, and \$500 to GiveDirectly.

Trajectory of Civilization (Category B)

First, I plan to put \$2000 into escrow for the purpose of supporting any useful small projects (specifically in the field of computer science / machine learning) that I come across in the next year. For the remaining \$5000, I plan to allocate \$4000 of it to the donor lottery, \$500 to the Carnegie Endowment, and \$500 to the Blue Ribbon Study Panel on Biodefense. For the latter, I wanted to donate to something that improved medium-term international security, because I believe that this is an important area that is relatively under-invested in by the effective altruist community (both in terms of money and cognitive effort). Here are all of the major possibilities that I considered:

Donating to the Future of Humanity Institute, with funds earmarked towards their collaboration with Allan Dafoe. I decided against this because my impression was that this particular project was not funding-constrained. (However, I am very excited by the work that Allan and his collaborators are doing, and would like to find ways to meaningfully support it.)
Donating to the Carnegie Endowment, restricted specifically to the Carnegie-Tsinghua Center. My understanding is that this is one of the few western organizations working to influence China’s nuclear policy (though this is based on personal conversation and not something I have looked into myself). My intuition is that influencing Chinese nuclear policy is substantially more tractable than U.S. nuclear policy, due to far fewer people trying to do so. In addition, from looking at their website, I felt that most of the areas they worked in were important areas, which I believe to be unusual for large organizations with multiple focuses (as a contrast, for other organizations with a similar number of focus areas, I felt that roughly half of the areas were obviously orders of magnitude less important than the areas I was most excited about). I had some reservations about donating (due to their size: \$30 million in revenue per year, and \$300 million in assets), but I decided to donate \$500 anyways because I am excited about this general type of work. (This organization was brought to my attention by Nick Beckstead; Nick notes that he doesn’t have strong opinions about this organization, primarily due to not knowing much about them.)
Donating to the Blue Ribbon Study Panel: I am basically trusting Jaime Yassif that this is a strong recommendation within the area of biodefense.
Donating to the ACLU: The idea here would be to decrease the probability that a President Trump seriously erodes democratic norms within the U.S. I however currently expect the ACLU to be well-funded (my understanding is that they got a flood of donations after Trump was elected).
Donating to the DNC or the Obama/Holder redistricting campaign: This is based on the idea that (1) Democrats are much better than Republicans for global stability / good U.S. policy, and (2) Republicans should be punished for helping Trump to become president. I basically agree with both, and could see myself donating to the redistricting campaign in particular in the future, but this intuitively feels less tractable/underfunded than non-partisan efforts like the Carnegie Endowment or Blue Ribbon Study Panel.
Creating a prize fund for incentivizing important research projects within computer science: I was originally planning to allocate \$1000 to \$2000 to this, based on the idea that computer science is a key field for multiple important areas (both AI safety and cyber security) and that as an expert in this field I would be in a unique position to identify useful projects relative to others in the EA community. However, after talking to several people and thinking about it myself, I decided that it was likely not tractable to provide meaningful incentives via prizes at such a small scale, and opted to instead set aside \$2000 to support promising projects as I come across them.

(As a side note: it isn’t completely clear to me whether the Carnegie Endowment accepts small donations. I plan to contact them about this, and if they do not, allocate the money to the Blue Ribbon Study Panel instead.)

In the remainder of this post I will briefly describe the \$2000 project fund, how I plan to use it, and why I decided it was a strong giving opportunity. I also plan to describe this in more detail in a separate follow-up post. Credit goes to Owen Cotton-Barratt for suggesting this idea. In addition, one of Paul Christiano’s blog posts inspired me to think about using prizes to incentivize research, and Holden Karnofsky further encouraged me to think along these lines.

The idea behind the project fund is similar to the idea behind the prize fund: I understand research in computer science better than most other EAs, and can give in a low-friction way on scales that are too small for organizations like Open Phil to think about. Moreover, it is likely good for me to develop a habit of evaluating projects I come across and thinking about whether they could benefit from additional money (either because they are funding constrained, or to incentivize an individual who is on the fence about carrying the project out). Finally, if this effort is successful, it is possible that other EAs will start to do this as well, which could magnify the overall impact. I think there is some danger that I will not be able to allocate the \$2000 in the next year, in which case any leftover funds will go to next year’s donor lottery.

Thinking Outside One’s Paradigm

2016-12-26T00:00:00-08:00

When I meet someone who works in a field outside of computer science, I usually ask them a lot of questions about their field that I’m curious about. (This is still relevant even if I’ve already met someone in that field before, because it gives me an idea of the range of expert consensus; for some questions this ends up being surprisingly variable.) I often find that, as an outsider, I can think of natural-seeming questions that experts in the field haven’t thought about, because their thinking is confined by their field’s paradigm while mine is not (pessimistically, it’s instead constrained by a different paradigm, i.e. computer science).

Usually my questions are pretty naive, and are basically what a computer scientist would think to ask based on their own biases. For instance:

Neuroscience: How much computation would it take to simulate a brain? Do our current theories of how neurons work allow us to do that even in principle?
Political science: How does the rise of powerful multinational corporations affect theories of international security (typical past theories assume that the only major powers are states)? How do we keep software companies (like Google, etc.) politically accountable? How will cyber attacks / cyber warfare affect international security?
Materials science: How much of the materials design / discovery process can be automated? What are the bottlenecks to building whatever materials we would like to? How can different research groups effectively communicate and streamline their steps for synthesizing materials?

When I do this, it’s not unusual for me to end up asking questions that the other person hasn’t really thought about before. In this case, responses range from “that’s not a question that our field studies” to “I haven’t thought about this much, but let’s try to think it through on the spot”. Of course, sometimes the other person has thought about it, and sometimes my question really is just silly or ill-formed for some reason (I suspect this is true more often than I’m explicitly made aware of, since some people are too polite to point it out to me).

I find the cases where the other person hasn’t thought about the question to be striking, because it means that I as a naive outsider can ask natural-seeming questions that haven’t been considered before by an expert in the field. I think what is going on here is that I and my interlocutor are using different paradigms (in the Kuhnian sense) for determining what questions are worth asking in a field. But while there is a sense in which the other person’s paradigm is more trustworthy – since it arose from a consensus of experts in the relevant field – that doesn’t mean that it’s absolutely reliable. Paradigms tend to blind one to evidence or problems that don’t fit into that paradigm, and paradigm shifts in science aren’t really that rare. (In addition, many fields including machine learning don’t even have a single agreed-upon paradigm.)

I think that as a scientist (or really, even as a citizen) it is important to be able to see outside one’s own paradigm. I currently think that I do a good job of this, but it seems to me that there’s a big danger of becoming more entrenched as I get older. Based on the above experiences, I plan to use the following test: When someone asks me a question about my field, how often have I not thought about it before? How tempted am I to say, “That question isn’t interesting”? If these start to become more common, then I’ll know something has gone wrong.

A few miscellaneous observations:

There are several people I know who routinely have answers to whatever questions I ask. Interestingly, they tend to be considered slightly “crackpot-ish” within their field; and they might also be less successful by conventional metrics, relatively to how smart they are considered by their colleagues. I think this is a result of the fact that most academic fields over-reward progress within that field’s paradigm and under-reward progress outside of it.
Beyond “slightly crakpot-ish academics”, the other set of people who routinely have answers to my questions are philosophers and some people in program manager roles (this includes certain types of VCs as well).
I would guess that in general technical fields that overlap with the humanities are more likely to take a broad view and not get stuck in a single paradigm. For instance, I would expect political scientists to have thought about most of the political science questions I mentioned above; however, I haven’t talked to enough political scientists (or social scientists in general) to have much confidence in this.

Two Strange Facts

2016-08-25T00:00:00-07:00

Here are two strange facts about matrices, which I can prove but not in a satisfying way.

If $A$ and $B$ are symmetric matrices satisfying $0 \preceq A \preceq B$, then $A^{1/2} \preceq B^{1/2}$, and $B^{-1} \preceq A^{-1}$, but it is NOT necessarily the case that $A^2 \preceq B^2$. Is there a nice way to see why the first two properties should hold but not necessarily the third? In general, do we have $A^p \preceq B^p$ if $p \in [0,1]$?
Given a rectangular matrix $W \in \mathbb{R}^{n \times d}$, and a set $S \subseteq [n]$, let $W_S$ be the submatrix of $W$ with rows in $S$, and let $\|W_S\|_*$ denote the nuclear norm (sum of singular values) of $W_S$. Then the function $f(S) = \|W_S\|_*$ is submodular, meaning that $f(S \cup T) + f(S \cap T) \leq f(S) + f(T)$ for all sets $S, T$. In fact, this is true if we take $f_p(S)$, defined as the sum of the $p$th powers of the singular values of $W_S$, for any $p \in [0,2]$. The only proof I know involves trigonometric integrals and seems completely unmotivated to me. Is there any clean way of seeing why this should be true?

If anyone has insight into either of these, I’d be very interested!

Difficulty of Predicting the Maximum of Gaussians

2016-01-13T00:00:00-08:00

Suppose that we have a random variable $X \in \mathbb{R}^d$, such that $\mathbb{E}[XX^{\top}] = I_{d \times d}$. Now take k independent Gaussian random variables $Z_1, \ldots, Z_k \sim \mathcal{N}(0, I_{d \times d})$, and let J be the argmax (over j in 1, …, k) of $Z_j^{\top}X$.

It seems that it should be very hard to predict J well, in the following sense: for any function $q(j \mid x)$, the expectation of $\mathbb{E}_{x}[q(J \mid x)]$, should with high probability be very close to $\frac{1}{k}$ (where the second probability is taken over the randomness in $Z$). In fact, Alex Zhai and I think that the probability of the expectation exceeding $\frac{1}{k}$ should be at most $\exp(-C(\epsilon/k)^2d)$ for some constant C. (We can already show this to be true where we replace $(\epsilon/k)^2$ with $(\epsilon/k)^4$.) I will not sketch a proof here but the idea is pretty cool, it basically uses Lipschitz concentration of Gaussian random variables.

I’m mainly posting this problem because I think it’s pretty interesting, in case anyone else is inspired to work on it. It is closely related to the covering number of exponential families under the KL divergence, where we are interested in coverings at relatively large radii ($\log(k) - \epsilon$ rather than $\epsilon$).

Maximal Maximum-Entropy Sets

2015-09-07T00:00:00-07:00

Consider a probability distribution ${p(y)}$ on a space ${\mathcal{Y}}$. Suppose we want to construct a set ${\mathcal{P}}$ of probability distributions on ${\mathcal{Y}}$ such that ${p(y)}$ is the maximum-entropy distribution over ${\mathcal{P}}$:

$\displaystyle H(p) = \max_{q \in \mathcal{P}} H(q), $

where ${H(p) = \mathbb{E}_{p}[-\log p(y)]}$ is the entropy. We call such a set a maximum-entropy set for ${p}$. Furthermore, we would like ${\mathcal{P}}$ to be as large as possible, subject to the constraint that ${\mathcal{P}}$ is convex.

Does such a maximal convex maximum-entropy set ${\mathcal{P}}$ exist? That is, is there some convex set ${\mathcal{P}}$ such that ${p}$ is the maximum-entropy distribution in ${\mathcal{P}}$, and for any ${\mathcal{Q}}$ satisfying the same property, ${\mathcal{Q} \subseteq \mathcal{P}}$? It turns out that the answer is yes, and there is even a simple characterization of ${\mathcal{P}}$:

Proposition 1 For any distribution ${p}$ on ${\mathcal{Y}}$, the set

$\displaystyle \mathcal{P} = \{q \mid \mathbb{E}_{q}[-\log p(y)] \leq H(p)\} $

is the maximal convex maximum-entropy set for ${p}$.

To see why this is, first note that, clearly, ${p \in \mathcal{P}}$, and for any ${q \in \mathcal{P}}$ we have

$\displaystyle \begin{array}{rcl} H(q) &=& \mathbb{E}_{q}[-\log q(y)] \\ &\leq& \mathbb{E}_{q}[-\log p(y)] \\ &\leq& H(p), \end{array} $

so ${p}$ is indeed the maximum-entropy distribution in ${\mathcal{P}}$. On the other hand, let ${\mathcal{Q}}$ be any other convex set whose maximum-entropy distribution is ${p}$. Then in particular, for any ${q \in \mathcal{Q}}$, we must have ${H((1-\epsilon)p + \epsilon q) \leq H(p)}$. Let us suppose for the sake of contradiction that ${q \not\in \mathcal{P}}$, so that ${\mathbb{E}_{q}[-\log p(y)] > H(p)}$. Then we have

$\displaystyle \begin{array}{rcl} H((1-\epsilon)p + \epsilon q) &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log((1-\epsilon)p(y)+\epsilon q(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log(p(y) + \epsilon (q(y)-p(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[-\log(p(y)) - \epsilon \frac{q(y)-p(y)}{p(y)} + \mathcal{O}(\epsilon^2)\right] \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon^2 \mathbb{E}_{q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) + \mathcal{O}(\epsilon^2). \end{array} $

Since ${\mathbb{E}_{q}[-\log p(y)] - H(p) > 0}$, for sufficiently small ${\epsilon}$ this will exceed ${H(p)}$, which is a contradiction. Therefore we must have ${q \in \mathcal{P}}$ for all ${q \in \mathcal{Q}}$, and hence ${\mathcal{Q} \subseteq \mathcal{P}}$, so that ${\mathcal{P}}$ is indeed the maximal convex maximum-entropy set for ${p}$.

Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems

2015-06-24T00:00:00-07:00

Introduction

There has been much recent discussion about AI risk, meaning specifically the potential pitfalls (both short-term and long-term) that AI with improved capabilities could create for society. Discussants include AI researchers such as Stuart Russell and Eric Horvitz and Tom Dietterich, entrepreneurs such as Elon Musk and Bill Gates, and research institutes such as the Machine Intelligence Research Institute (MIRI) and Future of Humanity Institute (FHI); the director of the latter institute, Nick Bostrom, has even written a bestselling book on this topic. Finally, ten million dollars in funding have been earmarked towards research on ensuring that AI will be safe and beneficial. Given this, I think it would be useful for AI researchers to discuss the nature and extent of risks that might be posed by increasingly capable AI systems, both short-term and long-term. As a PhD student in machine learning and artificial intelligence, this essay will describe my own views on AI risk, in the hopes of encouraging other researchers to detail their thoughts, as well.

For the purposes of this essay, I will define “AI” to be technology that can carry out tasks with limited or no human guidance, “advanced AI” to be technology that performs substantially more complex and domain-general tasks than are possible today, and “highly capable AI” to be technology that can outperform humans in all or almost all domains. As the primary target audience of this essay is other researchers, I have used technical terms (e.g. weakly supervised learning, inverse reinforcement learning) whenever they were useful, though I have also tried to make the essay more generally accessible when possible.

Outline

I think it is important to distinguish between two questions. First, does artificial intelligence merit the same degree of engineering safety considerations as other technologies (such as bridges)? Second, does artificial intelligence merit additional precautions, beyond those that would be considered typical? I will argue that the answer is yes to the first, even in the short term, and that current engineering methodologies in the field of machine learning do not provide even a typical level of safety or robustness. Moreover, I will argue that the answer to the second question in the long term is likely also yes — namely, that there are important ways in which highly capable artificial intelligence could pose risks which are not addressed by typical engineering concerns.

The point of this essay is not to be alarmist; indeed, I think that AI is likely to be net-positive for humanity. Rather, the point of this essay is to encourage a discussion about the potential pitfalls posed by artificial intelligence, since I believe that research done now can mitigate many of these pitfalls. Without such a discussion, we are unlikely to understand which pitfalls are most important or likely, and thus unable to design effective research programs to prevent them.

A common objection to discussing risks posed by AI is that it seems somewhat early on to worry about such risks, and the discussion is likely to be more germane if we wait to have it until after the field of AI has advanced further. I think this objection is quite reasonable in the abstract; however, as I will argue below, I think we do have a reasonable understanding of at least some of the risks that AI might pose, that some of these will be realized even in the medium term, and that there are reasonable programs of research that can address these risks, which in many cases would also have the advantage of improving the usability of existing AI systems.

Ordinary Engineering

There are many issues related to AI safety that are just a matter of good engineering methodology. For instance, we would ideally like systems that are transparent, modular, robust, and work under well-understood assumptions. Unfortunately, machine learning as a field has not developed very good methodologies for obtaining any of these things, and so this is an important issue to remedy. In other words, I think we should put at least as much thought into building an AI as we do into building a bridge.

Just to be very clear, I do not think that machine learning researchers are bad engineers; looking at any of the open source tools such as Torch, Caffe, MLlib, and others make it clear that many machine learning researchers are also good software engineers. Rather, I think that as a field our methodologies are not mature enough to address the specific engineering desiderata of statistical models (in contrast to the algorithms that create them). In particular, the statistical models obtained from machine learning algorithms tend to be:

Opaque: Many machine learning models consist of hundreds of thousands of parameters, making it difficult to understand how predictions are made. Typically, practitioners resort to error analysis examining the covariates that most strongly influence each incorrect prediction. However, this is not a very sustainable long-term solution, as it requires substantial effort even for relatively narrow-domain systems.
Monolithic: In part due to their opacity, models act as a black box, with no modularity or encapsulation of behavior. Though machine learning systems are often split into pipelines of smaller models, the lack of encapsulation can make these pipelines even harder to manage than a single large model; indeed, since machine learning models are by design optimized for a particular input distribution (i.e. whatever distribution they are trained on), we end up in a situation where “Changing Anything Changes Everything” [1].
Fragile: As another consequence of being optimized for a particular training distribution, machine learning models can have arbitrarily poor performance when that distribution shifts. For instance, Daumé and Marcu [2] show that a named entity classifier with 92% accuracy on one dataset drops to 58% accuracy on a superficially similar dataset. Though such issues are partially addressed by work on transfer learning and domain adaptation [3], these areas are not very developed compared to supervised learning.
Poorly understood: Beyond their fragility, understanding when a machine learning model will work is difficult. We know that a model will work if it is tested on the same distribution it is trained on, and have some extensions beyond this case (e.g. based on robust optimization [4]), but we have very little in the way of practically relevant conditions under which a model trained in one situation will work well in another situation. Although they are related, this issue differs from the opacity issue above in that it relates to making predictions about the system’s future behavior (in particular, generalization to new situations), versus understanding the internal workings of the current system.

That these issues plague machine learning systems is likely uncontroversial among machine learning researchers. However, in comparison to research focused on extending capabilities, very little is being done to address them. Research in this area therefore seems particularly impactful, especially given the desire to deploy machine learning systems in increasingly complex and safety-critical situations.

Extraordinary Engineering

Does AI merit additional safety precautions, beyond those that are considered standard engineering practice in other fields? Here I am focusing only on the long-term impacts of advanced or highly capable AI systems.

My tentative answer is yes; there seem to be a few different ways in which AI could have bad effects, each of which seems individually unlikely but not implausible. Even if each of the risks identified so far are not likely, (i) the total risk might be large, especially if there are additional unidentified risks, and (ii) the existence of multiple “near-misses” motivates closer investigation, as it may suggest some underlying principle that makes AI risk-laden. In the sequel I will focus on so-called “global catastrophic” risks, meaning risks that could affect a large fraction of the earth’s population in a material way. I have chosen to focus on these risks because I think there is an important difference between an AI system messing up in a way that harms a few people (which would be a legal liability but perhaps should not motivate a major effort in terms of precautions) and an AI system that could cause damage on a global scale. The latter would justify substantial precautions, and I want to make it clear that this is the bar I am setting for myself.

With that in place, below are a few ways in which advanced or highly capable AI could have specific global catastrophic risks.

Cyber-attacks. There are two trends which taken together make the prospect of AI-aided cyber-attacks seem worrisome. The first trend is simply the increasing prevalence of cyber-attacks; even this year we have seen Russia attack Ukraine, North Korea attack Sony, and China attack the U.S. Office of Personnel Management. Secondly, the “Internet of Things” means that an increasing number of physical devices will be connected to the internet. Assuming that software exists to autonomously control them, many internet-enabled devices such as cars could be hacked and then weaponized, leading to a decisive military advantage in a short span of time. Such an attack could be enacted by a small group of humans aided by AI technologies, which would make it hard to detect in advance. Unlike other weaponizable technology such as nuclear fission or synthetic biology, it would be very difficult to control the distribution of AI since it does not rely on any specific raw materials. Finally, note that even a team with relatively small computing resources could potentially “bootstrap” to much more computing power by first creating a botnet with which to do computations; to date, the largest botnet has spanned 30 million computers and several other botnets have exceeded 1 million.

Autonomous weapons. Beyond cyber-attacks, improved autonomous robotics technology combined with ubiquitous access to miniature UAVs (“drones”) could allow both terrorists and governments to wage a particularly pernicious form of remote warfare by creating weapons that are both cheap and hard to detect or defend against (due to their small size and high maneuverability). Beyond direct malicious intent, if autonomous weapons systems or other powerful autonomous systems malfunction then they could cause a large amount of damage.

Mis-optimization. A highly capable AI could acquire a large amount of power but pursue an overly narrow goal, and end up harming humans or human value while optimizing for this goal. This may seem implausible at face value, but as I will argue below, it is easier to improve AI capabilities than to improve AI values, making such a mishap possible in theory.

Unemployment. It is already the case that increased automation is decreasing the number of available jobs, to the extent that some economists and policymakers are discussing what to do if the number of jobs is systematically smaller than the number of people seeking work. If AI systems allow a large number of jobs to be automated over a relatively short time period, then we may not have time to plan or implement policy solutions, and there could then be a large unemployment spike. In addition to the direct effects on the people who are unemployed, such a spike could also have indirect consequences by decreasing social stability on a global scale.

Opaque systems. It is also already the case that increasingly many tasks are being delegated to autonomous systems, from trades in financial markets to aggregation of information feeds. The opacity of these systems has led to issues such as the 2010 Flash Crash and will likely lead to larger issues in the future. In the long term, as AI systems become increasingly complex, humans may lose the ability to meaningfully understand or intervene in such systems, which could lead to a loss of sovereignty if autonomous systems are employed in executive-level functions (e.g. government, economy).

Beyond these specific risks, it seems clear that, eventually, AI will be able to outperform humans in essentially every domain. At that point, it seems doubtful that humanity will continue to have direct causal influence over its future unless specific measures are put in place to ensure this. While I do not think this day will come soon, I think it is worth thinking now about how we might meaningfully control highly capable AI systems, and I also think that many of the risks posed above (as well as others that we haven’t thought of yet) will occur on a somewhat shorter time scale.

Let me end with some specific ways in which control of AI may be particularly difficult compared to other human-engineered systems:

AI may be “agent-like”, which means that the space of possible behaviors is much larger; our intuitions about how AI will act in pursuit of a given goal may not account for this and so AI behavior could be hard to predict.
Since an AI would presumably learn from experience, and will likely run at a much faster serial processing speed than humans, its capabilities may change rapidly, ruling out the usual process of trial-and-error.
AI will act in a much more open-ended domain. In contrast, our existing tools for specifying the necessary properties of a system only work well in narrow domains. For instance, for a bridge, safety relates to the ability to successfully accomplish a small number of tasks (e.g. not falling over). For these, it suffices to consider well-characterized engineering properties such as tensile strength. For AI, the number of tasks we would potentially want it to perform is large, and it is unclear how to obtain a small number of well-characterized properties that would ensure safety.
Existing machine learning frameworks make it very easy for AI to acquire knowledge, but hard to acquire values. For instance, while an AI’s model of reality is flexibly learned from data, its goal/utility function is hard-coded in almost all situations; an exception is some work on inverse reinforcement learning [5], but this is still a very nascent framework. Importantly, the asymmetry between knowledge (and hence capabilities) and values is fundamental, rather than simply a statement about existing technologies. This is because knowledge is something that is regularly informed by reality, whereas values are only weakly informed by reality: an AI which learns incorrect facts could notice that it makes wrong predictions, but the world might never “tell” an AI that it learned the “wrong values”. At a technical level, while many tasks in machine learning are fully supervised or at least semi-supervised, value acquisition is a weakly supervised task.

In summary: there are several concrete global catastrophic risks posed by highly capable AI, and there are also several reasons to believe that highly capable AI would be difficult to control. Together, these suggest to me that the control of highly capable AI systems is an important problem posing unique research challenges.

Long-term Goals, Near-term Research

Above I presented an argument for why AI, in the long term, may require substantial precautionary efforts. Beyond this, I also believe that there is important research that can be done right now to reduce long-term AI risks. In this section I will elaborate on some specific research projects, though my list is not meant to be exhaustive.

Value learning: In general, it seems important in the long term (and also in the short term) to design algorithms for learning values / goal systems / utility functions, rather than requiring them to be hand-coded. One framework for this is inverse reinforcement learning [5], though developing additional frameworks would also be useful.
Weakly supervised learning: As argued above, inferring values, in contrast to beliefs, is an at most weakly supervised problem, since humans themselves are often incorrect about what they value and so any attempt to provide fully annotated training data about values would likely contain systematic errors. It may be possible to infer values indirectly through observing human actions; however, since humans often act immorally and human values change over time, current human actions are not consistent with our ideal long-term values, and so learning from actions in a naive way could lead to problems. Therefore, a better fundamental understanding of weakly supervised learning — particularly regarding guaranteed recovery of indirectly observed parameters under well-understood assumptions — seems important.
Formal specification / verification: One way to make AI safer would be to formally specify desiderata for its behavior, and then prove that these desiderata are met. A major open challenge is to figure out how to meaningfully specify formal properties for an AI system. For instance, even if a speech transcription system did a near-perfect job of transcribing speech, it is unclear what sort of specification language one might use to state this property formally. Beyond this, though there is much existing work in formal verification, it is still extremely challenging to verify large systems.
Transparency: To the extent that the decision-making process of an AI is transparent, it should be relatively easy to ensure that its impact will be positive. To the extent that the decision-making process is opaque, it should be relatively difficult to do so. Unfortunately, transparency seems difficult to obtain, especially for AIs that reach decisions through complex series of serial computations. Therefore, better techniques for rendering AI reasoning transparent seem important.
Strategic assessment and planning: Better understanding of the likely impacts of AI will allow a better response. To this end, it seems valuable to map out and study specific concrete risks; for instance, better understanding ways in which machine learning could be used in cyber-attacks, or forecasting the likely effects of technology-driven unemployment, and determining useful policies around these effects. It would also be clearly useful to identify additional plausible risks beyond those of which we are currently aware. Finally, thought experiments surrounding different possible behaviors of advanced AI would help inform intuitions and point to specific technical problems. Some of these tasks are most effectively carried out by AI researchers, while others should be done in collaboration with economists, policy experts, security experts, etc.

The above constitute at least five concrete directions of research on which I think important progress can be made today, which would meaningfully improve the safety of advanced AI systems and which in many cases would likely have ancillary benefits in the short term, as well.

At a high level, while I have implicitly provided a program of research above, there are other proposed research programs as well. Perhaps the earliest proposed program is from MIRI [6], which has focused on AI alignment problems that arise even in simplified settings (e.g. with unlimited computing power or easy-to-specify goals) in hopes of later generalizing to more complex settings. The Future of Life Institute (FLI) has also published a research priorities document [7, 8] with a broader focus, including non-technical topics such as regulation of autonomous weapons and economic shifts induced by AI-based technologies. I do not necessarily endorse either document, but think that both represent a big step in the right direction. Ideally, MIRI, FLI, and others will all justify why they think their problems are worth working on and we can let the best arguments and counterarguments rise to the top. This is already happening to some extent [9, 10, 11] but I would like to see more of it, especially from academics with expertise in machine learning and AI [12, 13].

In addition, several specific arguments I have advanced are similar to those already advanced by others. The issue of AI-driven unemployment has been studied by Brynjolfsson and McAfee [14], and is also discussed in the FLI research document. The problem of AI pursuing narrow goals has been elaborated through Bostrom’s “paperclipping argument” [15] as well as the orthogonality thesis [16], which states that beliefs and values are independent of each other. While I disagree with the orthogonality thesis in its strongest form, the arguments presented above for the difficulty of value learning can in many cases reach similar conclusions.

Omohundro [17] has argued that advanced agents would pursue certain instrumentally convergent drives under almost any value system, which is one way in which agent-like systems differ from systems without agency. Good [18] was the first to argue that AI capabilities could improve rapidly. Yudkowsky has argued that it would be easy for an AI to acquire power given few initial resources [19], though his example assumes the creation of advanced biotechnology.

Christiano has argued for the value of transparent AI systems, and proposed the “advisor games” framework as a potential operationalization of transparency [20].

Conclusion

To ensure the safety of AI systems, additional research is needed, both to meet ordinary short-term engineering desiderata as well as to make the additional precautions specific to highly capable AI systems. In both cases, there are clear programs of research that can be undertaken today, which in many cases seem to be under-researched relative to their potential societal value. I therefore think that well-directed research towards improving the safety of AI systems is a worthwhile undertaking, with the additional benefit of motivating interesting new directions of research.

Acknowledgments

Thanks to Paul Christiano, Holden Karnofsky, Percy Liang, Luke Muehlhauser, Nick Beckstead, Nate Soares, and Howie Lempel for providing feedback on a draft of this essay.

References

[1] D. Sculley, et al. Machine Learning: The High-Interest Credit Card of Technical Debt. 2014. [2] Hal Daumé III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101–126, 2006. [3] Sinno J. Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [4] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011. [5] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference in Machine Learning, pages 663–670, 2000. [6] Nate Soares and Benja Fallenstein. Aligning Superintelligence with Human Interests: A Technical Research Agenda. 2014. [7] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. 2015. [8] Daniel Dewey, Stuart Russell, and Max Tegmark. A survey of research questions for robust and beneficial AI. 2015. [9] Paul Christiano. The Steering Problem. 2015. [10] Paul Christiano. Stable self-improvement as an AI safety problem. 2015. [11] Luke Muehlhauser. How to study superintelligence strategy. 2014. [12] Stuart Russell. Of Myths and Moonshine. 2014. [13] Tom Dietterich and Eric Horvitz. Benefits and Risks of Artificial Intelligence. 2015. [14] Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company, 2014. [15] Nick Bostrom (2003). Ethical Issues in Advanced Artificial Intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence. [16] Nick Bostrom. “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents.” Minds and Machines 22.2 (2012): 71-85. [17] Stephen M. Omohundro (2008). The Basic AI Drives. Frontiers in Artificial Intelligence and Applications (IOS Press). [18] Irving J. Good. “Speculations concerning the first ultraintelligent machine.” Advances in computers 6.99 (1965): 31-83. [19] Eliezer Yudkowsky. “Artificial intelligence as a positive and negative factor in global risk.” Global catastrophic risks 1 (2008): 303. [20] Paul Christiano. Advisor Games. 2015.

A Fervent Defense of Frequentist Statistics

2014-02-10T00:00:00-08:00

[Highlights for the busy: de-bunking standard “Bayes is optimal” arguments; frequentist Solomonoff induction; and a description of the online learning framework.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

Bayesian methods are optimal.
Bayesian methods are optimal except for computational considerations.
We can deal with computational constraints simply by making approximations to Bayes.
The prior isn’t a big deal because Bayesians can always share likelihood ratios.
Frequentist methods need to assume their model is correct, or that the data are i.i.d.
Frequentist methods can only deal with simple models, and make arbitrary cutoffs in model complexity (aka: “I’m Bayesian because I want to do Solomonoff induction”).
Frequentist methods hide their assumptions while Bayesian methods make assumptions explicit.
Frequentist methods are fragile, Bayesian methods are robust.
Frequentist methods are responsible for bad science
Frequentist methods are unprincipled/hacky.
Frequentist methods have no promising approach to computationally bounded inference.

Myth 1__: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2__: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3__: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors $x_1$, $x_2$, $\ldots$, $x_{10,000}$, each with $10^6$ coordinates. I generate outputs $y_1$, $y_2$, $\ldots$, $y_{10,000}$ in the following way. First I globally select $100$ of the $10^6$ coordinates uniformly at random, then I select a fixed vector $u$ such that those $100$ coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set $x_n = u^{\top}y_n$ (i.e. $x_n$ is the dot product of $u$ with $y_n$). You are given $x$ and $y$, and your job is to infer $u$. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4__: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis $h$, there is also the hypothesis $h’$ that matches $h$ exactly up to now, and then predicts the opposite of $h$ at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are $(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)$. We don’t observe $y_t$ until after making a prediction $z_t$ of what $y_t$ will be, and then we receive a penalty $L(y_t, z_t)$ based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are $n$ other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, $x_t$ would be the $n$ bets that each of the other people recommend, $z_t$ would be the horse that I actually bet on, and $y_t$ would be the horse that actually wins the race. Then, supposing that $y_t = z_t$ (i.e., the horse I bet on actually wins), $L(y_t, z_t)$ is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, $L(y_t, z_t)$ is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the $n$ people (note that I only have access to the latter after all $T$ rounds of betting). More formally, given a class $\mathcal{M}$ of predictors $h : x \mapsto z$, I define

$\mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))$

In this case, $\mathcal{M}$ would have size $n$ and the $i$th predictor would just always follow the advice of person $i$. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that $\mathrm{Regret}(T)$ shrinks at a rate of $\sqrt{\frac{\log(n)}{T}}$. In other words, I can have an average score within $\epsilon$ of the best advisor after $\frac{\log(n)}{\epsilon^2}$ rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by \$1 per round, and that there are $1,000,000,000$ people in total, then an explicit upper bound is that after $28/\epsilon^2$ rounds, we will be within $\epsilon$ dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of $0.1$ about their mean, then the $28$ can be replaced with $4.5$.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; $\mathcal{M}$ could be continuous and $L$ could have quite general structure; the only technical assumption is that $\mathcal{M}$ be a convex set and that $L$ be a convex function of $z$. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an $n$-dimensional model family, typically $\mathrm{Regret}(T)$ decreases at a rate of $\sqrt{\frac{n}{T}}$, although under additional assumptions this can be reduced to $\sqrt{\frac{\log(n)}{T}}$, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the $\log(n)$ dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the $x_t$ and $y_t$ are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor $h$ in $\mathcal{M}$ (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map $\phi : X \rightarrow \mathbb{R}^n$ ($\mathbb{R}^n$ denotes $n$-dimensional space, i.e. vectors of real numbers of length $n$). For instance, if I think that $y$ is a polynomial (say cubic) function of $x$, I can apply the mapping $\phi(x) = (1, x, x^2, x^3)$, and now look for a linear relationship between $y$ and $\phi(x)$.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are $1$ if a certain event occurs and $0$ otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into $\mathbb{R}^{10^5}$ or $\mathbb{R}^{10^6}$).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function $\psi(h)$, one can show that, given $T$ data points, there is a strategy whose average error is at most $\sqrt{\frac{\psi(h)}{T}}$ worse than any estimator $h$. This can hold even if the model class $\mathcal{M}$ is infinite dimensional. For instance, if $\mathcal{M}$ consists of all probability distributions over Turing machines, and we let $h_i$ denote the probability mass placed on the $i$th Turing machine, then a valid regularizer $\psi$ would be

$\psi(h) = \sum_i h_i \log(i^2 \cdot h_i)$

If we consider this, then we see that, for any probability distribution over the first $2^k$ Turing machines (i.e. all Turing machines with description length $\leq k$), the value of $\psi$ is at most $\log((2^k)^2) = k\log(4)$. (Here we use the fact that $\psi(h) \geq \sum_i h_i \log(i^2)$, since $h_i \leq 1$ and hence $h_i\log(h_i) \leq 0$.) This means that, if we receive roughly $\frac{k}{\epsilon^2}$ data, we will achieve error within $\epsilon$ of the best Turing machine that has description length $\leq k$.

Let me note several things here:

This strategy makes no assumptions about the data being i.i.d. It doesn’t even assume that the data are computable. It just guarantees that it will perform as well as any Turing machine (or distribution over Turing machines) given the appropriate amount of data.
This guarantee holds for any given sufficiently smooth measurement of prediction error (the update strategy depends on the particular error measure).
This guarantee holds deterministically, no randomness required (although predictions may need to consist of probability distributions rather than specific points, but this is also true of Bayesian predictions).

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7__: frequentist methods hide their assumptions while B__ayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class $\mathcal{M}$, I want to do as well as any model in $\mathcal{M}$; or put another way:

Assumption: At least one model in $\mathcal{M}$ has error at most $E$. Guarantee: My method will have error at most $E + \epsilon$.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior. Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8:_ _frequentist methods are fragile,_ _Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9__: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution $p(x)$ and the only information we have about it is the expected value of a certain number of functions, i.e. the information that $\mathbb{E}[\phi(x)] = \phi^$, where the expectation is taken with respect to $p(x)$. We are interested in constructing a probability distribution $q(x)$ such that no matter what particular value $p(x)$ takes, $q(x)$ will still make good predictions. In other words (taking $\log p(x)$ as our measurement of prediction accuracy) we want $\mathbb{E}_{p’}[\log q(x)]$ to be large for all distributions $p’$ such that $\mathbb{E}_{p’}[\phi(x)] = \phi^$. Using a technique called Lagrangian duality, we can both find the optimal distribution $q$ and compute its worse-case accuracy over all $p’$ with $\mathbb{E}{p’}[\phi(x)] = \phi^*$. The characterization is as follows: consider all probability distributions $q(x)$ that are proportional to $\exp(\lambda^{\top}\phi(x))$ for some vector $\lambda$, i.e. $q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)$ for some $Z(\lambda)$. Of all of these, take the q(x) with the largest value of $\lambda^{\top}\phi^* - \log Z(\lambda)$. Then $q(x)$ will be the optimal distribution and the accuracy for _all distributions $p’$ will be exactly $\lambda^{\top}\phi^* - \log Z(\lambda)$. Furthermore, if $\phi^$ is the empirical expectation given some number of samples, then one can show that $\lambda^{\top}\phi^ - \log Z(\lambda)$ is propotional to the log likelihood of $q$, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11__: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices (Myth 5) Online learning: Online learning and online convex optimization (Myth 8) Robust statistics: see this blog post and the two linked papers (Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

Another Critique of Effective Altruism

2014-01-05T00:00:00-08:00

I’ve decided to branch out a bit from technical discussions and engage in, as Scott Aaronson would call it, some metaphysical spouting. The topic of today is the effective altruism movement. I’m about to be relentlessly critical of it, so this is probably not the best post to read as your first introduction. Instead, read this and this. Then you can read what follows (but keep in mind that there are also many good things about the EA movement that I’m failing to mention here).

Another Critique of Effective Altruism

Recently Ben Kuhn wrote a critique of effective altruism. I’m glad to see such self-examination taking place, but I’m also concerned that the essay did not attack some of the most serious issues I see in the effective altruist movement, so I’ve decided to write my own critique. Due to time constraints, this critique is short and incomplete. I’ve tried to bring up arguments that would make people feel uncomfortable and defensive; hopefully I’ve succeeded.

Briefly, here are some of the major issues I have with the effective altruism movement as it currently stands:

Over-focus on “tried and true” and “default” options, which may both reduce actual impact and decrease exploration of new potentially high-value opportunities.
Over-confident claims coupled with insufficient background research.
Over-reliance on a small set of tools for assessing opportunities, which lead many to underestimate the value of things such as “flow-through” effects.

The common theme here is a subtle underlying message that simple, shallow analyses can allow one to make high-impact career and giving choices, and divest one of the need to dig further. I doubt that anyone explicitly believes this, but I do believe that this theme comes out implicitly both in arguments people make and in actions people take.

Lest this essay give a mistaken impression to the casual reader, I should note that there are many examplary effective altruists who I feel are mostly immune to the issues above; for instance, the GiveWell blog does a very good job of warning against the first and third points above, and I would recommend anyone who isn’t already to subscribe to it (and there are other examples that I’m failing to mention). But for the purposes of this essay, I will ignore this fact except for the current caveat.

Over-focus on “tried and true” options

It seems to me that the effective altruist movement over-focuses on “tried and true” options, both in giving opportunities and in career paths. Perhaps the biggest example of this is the prevalence of “earning to give”. While this is certainly an admirable option, it should be considered as a baseline to improve upon, not a definitive answer.

The biggest issue with the “earning to give” path is that careers in finance and software (the two most common avenues for this) are incredibly straight-forward and secure. The two things that finance and software have in common is that there is a well-defined application process similar to the one for undergraduate admissions, and given reasonable job performance one will continue to be given promotions and raises (this probably entails working hard, but the end result is still rarely in doubt). One also gets a constant source of extrinsic positive reinforcement from the money they earn. Why do I call these things an “issue”? Because I think that these attributes encourage people to pursue these paths without looking for less obvious, less certain, but ultimately better paths. One in six Yale graduates go into finance and consulting, seemingly due to the simplicity of applying and the easy supply of extrinsic motivation. My intuition is that this ratio is higher than an optimal society would have, even if such people commonly gave generously (and it is certainly much higher than the number of people who enter college planning to pursue such paths).

Contrast this with, for instance, working at a start-up. Most start-ups are low-impact, but it is undeniable that at least some have been extraordinarily high-impact, so this seems like an area that effective altruists should be considering strongly. Why aren’t there more of us at 23&me, or Coursera, or Quora, or Stripe? I think it is because these opportunities are less obvious and take more work to find, once you start working it often isn’t clear whether what you’re doing will have a positive impact or not, and your future job security is massively uncertain. There are few sources of extrinsic motivation in such a career: perhaps moreso at one of the companies mentioned above, which are reasonably established and have customers, but what about the 4-person start-up teams working in a warehouse somewhere? Some of them will go on to do great things but right now their lives must be full of anxiousness and uncertainty.

I don’t mean to fetishize start-ups. They are just one well-known example of a potentially high-value career path that, to me, seems underexplored within the EA movement. I would argue (perhaps self-servingly) that academia is another example of such a path, with similar psychological obstacles: every 5 years or so you have the opportunity to get kicked out (e.g. applying for faculty jobs, and being up for tenure), you need to relocate regularly, few people will read your work and even fewer will praise it, and it won’t be clear whether it had a positive impact until many years down the road. And beyond the “obvious” alternatives of start-ups and academia, what of the paths that haven’t been created yet? GiveWell was revolutionary when it came about. Who will be the next GiveWell? And by this I don’t mean the next charity evaluator, but the next set of people who fundamentally alter how we view altruism.

Over-confident claims coupled with insufficient background research

The history of effective altruism is littered with over-confident claims, many of which have later turned out to be false. In 2009, Peter Singer claimed that you could save a life for \$200 (and many others repeated his claim). While the number was already questionable at the time, by 2011 we discovered that the number was completely off. Now new numbers were thrown around: from numbers still in the hundreds of dollars (GWWC’s estimate for SCI, which was later shown to be flawed) up to \$1600 (GiveWell’s estimate for AMF, which GiveWell itself expected to go up, and which indeed did go up). These numbers were often cited without caveats, as well as other claims such as that the effectiveness of charities can vary by a factor of 1,000. How many people citing these numbers understood the process that generated them, or the high degree of uncertainty surrounding them, or the inaccuracy of past estimates? How many would have pointed out that saying that charities vary by a factor of 1,000 in effectiveness is by itself not very helpful, and is more a statement about how bad the bottom end is than how good the top end is?

More problematic than the careless bandying of numbers is the tendency toward not doing strong background research. A common pattern I see is: an effective altruist makes a bold claim, then when pressed on it offers a heuristic justification together with the claim that “estimation is the best we have”. This sort of argument acts as a conversation-stopper (and can also be quite annoying, which may be part of what drives some people away from effective altruism). In many of these cases, there are relatively easy opportunities to do background reading to further educate oneself about the claim being made. It can appear to an outside observer as though people are opting for the fun, easy activity (speculation) rather than the harder and more worthwhile activity (research). Again, I’m not claiming that this is people’s explicit thought process, but it does seem to be what ends up happening.

Why haven’t more EAs signed up for a course on global security, or tried to understand how DARPA funds projects, or learned about third-world health? I’ve heard claims that this would be too time-consuming relative to the value it provides, but this seems like a poor excuse if we want to be taken seriously as a movement (or even just want to reach consistently accurate conclusions about the world).

Over-reliance on a small set of tools

Effective altruists tend to have a lot of interest in quantitative estimates. We want to know what the best thing to do is, and we want a numerical value. This causes us to rely on scientific studies, economic reports, and Fermi estimates. It can cause us to underweight things like the competence of a particular organization, the strength of the people involved, and other “intangibles” (which are often not actually intangible but simply difficult to assign a number to). It also can cause us to over-focus on money as a unit of altruism, while often-times “it isn’t about the money”: it’s about doing the groundwork that no one is doing, or finding the opportunity that no one has found yet.

Quantitative estimates often also tend to ignore flow-through effects: effects which are an indirect, rather than direct, result of an action (such as decreased disease in the third world contributing in the long run to increased global security). These effects are difficult to quantify but human and cultural intuition can do a reasonable job of taking them into account. As such, I often worry that effective altruists may actually be less effective than “normal” altruists. (One can point to all sorts of examples of farcical charities to claim that regular altruism sucks, but this misses the point that there are also amazing organizations out there, such as the Simons Foundation or HHMI, which are doing enormous amounts of good despite not subscribing to the EA philosophy.)

What’s particularly worrisome is that even if we were less effective than normal altruists, we would probably still end up looking better by our own standards, which explicitly fail to account for the ways in which normal altruists might outperform us (see above). This is a problem with any paradigm, but the fact that the effective altruist community is small and insular and relies heavily on its paradigm makes us far more susceptible to it.

Convex Conditions for Strong Convexity

2013-12-30T00:00:00-08:00

An important concept in online learning and convex optimization is that of strong convexity: a twice-differentiable function $f$ is said to be strongly convex with respect to a norm $\|\cdot\|$ if

$z^T\frac{\partial^2 f}{\partial x^2}z \geq \|z\|^2$

for all $z$ (for functions that are not twice-differentiable, there is an analogous criterion in terms of the Bregman divergence). To check strong convexity, then, we basically need to check a condition on the Hessian, namely that $z^THz \geq \|z\|^2$. So, under what conditions does this hold?

For the $l^2$ norm, the answer is easy: $z^THz \geq \|z\|_2^2$ if and only if $H \succeq I$ (i.e., $H-I$ is positive semidefinite). This can be shown in many ways, perhaps the easiest is by noting that $z^THz-\|z\|_2^2 = z^T(H-I)z$.

For the $l^{\infty}$ norm, the answer is a bit trickier but still not too complicated. Recall that we want necessary and sufficient conditions under which $z^THz \geq \|z\|_{\infty}^2$. Note that this is equivalent to asking that $z^THz \geq (z_i)^2$ for each coordinate $i$ of $z$, which in turn is equivalent to $H \succeq e_ie_i^T$ for each coordinate vector $e_i$ (these are the vectors that are 1 in the $i$th coordinate and 0 everywhere else).

More generally, for any norm $\|\cdot\|$, there exists a dual norm $\|\cdot\|_*$ which satisfies, among other properties, the relationship $\|z\| = \sup_{\|w\|_* = 1} w^Tz$. So, in general, $z^THz \geq \|z\|^2$ is equivalent to asking that $z^THz \geq (w^Tz)^2$ for all $w$ with $\|w\|_* = 1$. But this is in turn equivalent to asking that

$H \succeq ww^T$ for all $w$ such that $\|w\|_* = 1$.

In fact, it suffices to pick a subset of the $w$ such that the convex hull consists of all $w$ with $\|w\|_* \leq 1$; this is why we were able to obtain such a clean formulation in the $l^{\infty}$ case: the dual norm to $l^{\infty}$ is $l^1$, whose unit ball is the simplex, which is a polytope with only $2n$ vertices (namely, each of the signed unit vectors $\pm e_i$).

We can also derive a simple (but computationally expensive) criterion for $l^1$ strong convexity: here the dual norm is $l^{\infty}$, whose unit ball is the $n$-dimensional hypercube, with vertices given by all $2^n$ vectors of the form $[ \pm 1 \ \cdots \ \pm 1]$. Thus $z^THz \geq \|z\|_1^2$ if and only if $H \succeq ss^T$ for all $2^n$ sign vectors $s$.

Finally, we re-examine the $l^2$ case; even though the $l^2$-ball is not a polytope, we were still able to obtain a very simple expression. This was because the condition $H \succeq I$ manages to capture simultaneously all dual vectors such that $w^Tw \leq 1$. We thus have the general criterion:

Theorem. $H \succeq M_jM_j^T$ for $j = 1,\ldots,m$ if and only if $H$ is strongly convex with respect to the norm $\|\cdot\|$ whose dual unit ball is the convex hull of the transformed unit balls $M_j\mathcal{B}_j$, $j = 1, \ldots, m$, where $\mathcal{B}_j$ is the $l^2$ unit ball whose dimension matches the number of columns of $M_j$.

Proof. $H \succeq M_jM_j^T$ if and only if $z^THz \geq \max_{j=1}^m \|M_j^Tz\|_2^2$. Now note that $\|M_j^Tz\|_2 = \sup_{w \in \mathcal{B}_j} w^TM_j^Tz = \sup_{w’ \in M_j\mathcal{B}_j} (w’)^Tz$. If we define $\|z\| = \max_{j=1}^m \|M_j^Tz\|_2$, it is then apparent that the dual norm unit ball is the convex hull of the $M_j\mathcal{B}_j$.

Convexity counterexample

2013-06-12T00:00:00-07:00

Here’s a fun counterexample: a function $\mathbb{R}^n \to \mathbb{R}$ that is jointly convex in any $n-1$ of the variables, but not in all variables at once. The function is

$f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j$

To see why this is, note that the Hessian of $f$ is equal to

$\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]$

This matrix is equal to $(n-0.5)I - J$, where $I$ is the identity matrix and $J$ is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is $n$. Therefore, this matrix has $n-1$ eigenvalues of $n-0.5$, as well as a single eigenvalue of $-0.5$, and hence is not positive definite.

On the other hand, any submatrix of size $n-1$ is of the form $(n-0.5)I-J$, but where now $J$ is only $(n-1) \times (n-1)$. This matrix now has $n-2$ eigenvalues of $n-0.5$, together with a single eigenvalue of $0.5$, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any $n-1$ variables, and hence $f$ is convex in any $n-1$ variables, but not in all $n$ variables jointly.

Probabilistic Abstractions I

2013-03-15T00:00:00-07:00

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of $K$ samples for the left and right subtrees, then look at all $K^2$ ways of combining them into the entire tree, but then downsample again to the $K$ trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables $(X_1,\ldots,X_d)$, then our particles will be sets obtained by specifying the values of some of the $X_i$ and leaving the rest to vary arbitrarily. So, for instance, if $d=4$, then $\{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\}$ might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are $S_1,\ldots,S_k$, and we assign probability $\pi_i$ to $S_i$, then we have the polytope of distributions $p$ that satisfy the constraints $p(S_1) = \pi_1, p(S_2) = \pi_2$, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has $k$ particles, then the product distribution has $k^2$ particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the $k$ most informative particles. These will be the $k$ particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space $X$ and want to apply a function $f : X \to Y$ to it. If $f$ is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of $f$: a function $\tilde{f} : 2^X \to 2^Y$ such that $\tilde{f}(S) \supseteq f(S)$ for all $S \in 2^X$. In this case, if we map a particle $S$ to $\tilde{f}(S)$, our equality constraint on the mass assigned to $S$ becomes a lower bound on the mass assigned to $\tilde{f}(S)$ — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that $\tilde{f}$ (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

Pairwise Independence vs. Independence

2013-03-13T00:00:00-07:00

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if $X_1,\ldots,X_n$ are independent, then their sum has a distribution that decays like $e^{-x^2}$. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like $\frac{1}{x^2}$.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like $\frac{1}{x^2}$.

The construction is as follows: let $X_1,\ldots,X_d$ be independent binary random variables, and for any $S \subset \{1,\ldots,d\}$, let $Y_S = \sum_{i \in S} X_i$, where the sum is taken mod 2. Then we can easily check that the $Y_S$ are pairwise independent. Now consider the random variable $Z = \sum_{S} Y_S$. If any of the $X_i$ is equal to 1, then we can pair up the $Y_S$ by either adding or removing $i$ from $S$ to get the other element of the pair. If we do this, we see that $Z = 2^{d-1}$ in this case. On the other hand, if all of the $X_i$ are equal to 0, then $Z = 0$ as well. Thus, with probability $\frac{1}{2^d}$, $Z$ deviates from its mean by $2^{d-1}-\frac{1}{2}$, whereas the variance of $Z$ is $2^{d-2}-\frac{1}{4}$. The bound on this probability form Chebyshev is $\frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}$, which is very close to $\frac{1}{2^d}$, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.

A Fun Optimization Problem

2013-02-09T00:00:00-08:00

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of $l$ pairs of non-negative numbers $(a_1,b_1),\ldots,(a_l,b_l)$ such that $\sum_{i=1}^l a_i \leq A$ and $\sum_{i=1}^l b_i \leq B$. Devise an efficient algorithm to find the $k$ pairs $(a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k})$ that maximize

$\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).$

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in $O\left(\frac{l \log(l)}{\epsilon}\right)$ time and finds $2k$ pairs that do at least $1-\epsilon$ as well as the best set of $k$ pairs. It’s possible I need to assume something like $\sum_{i=1}^l a_i \leq A/2$ instead of just $A$ (and similarly for the $b_i$), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

Eigenvalue Bounds

2013-02-05T00:00:00-08:00

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric $n\times n$ matrices with eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n$ and $\mu_1 \geq \mu_2 \geq \ldots \geq \mu_n$ respectively, then $Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i$.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an $n\times n$ symmetric matrix and B is an $(n-k) \times (n-k)$ principle submatrix of A, then $\lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A)$, where $\lambda_i(X)$ is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, $\sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X)$.

Proof: The left-hand-side is just the trace of the upper-left $k\times k$ principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. $\square$

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ are such that $\sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i$ for all k (with equality when $k=n$), and $c_1 \geq c_2 \geq \ldots \geq c_n$, then $\sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i$.

Proof: We have:

$\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i$

where the equalities come from the Abel summation method. $\square$

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$. But by Corollary 1, we also know that $\sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i$ for all k. Since by assumption the $\lambda_i$ are a decreasing sequence, Theorem 3 then implies that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$, which is what we wanted to show. $\square$

Local KL Divergence

2013-02-02T00:00:00-08:00

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as

$KL(p

q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p

q) \neq KL(q

p)$. Intuitively, a small $KL(p

q)$ means that there are few points that $p$ assigns high probability to but that $q$ does not. We can also think of $KL(p

q)$ as the number of bits of information needed to update from the distribution $q$ to the distribution $p$.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p

q)$ in terms of the $KL(F_i

G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then

$KL\left(\sum_i \alpha_i F_i

\sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i

G_i)\right)$.

Proof: If we expand the definition, then we are trying to prove that

$\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as

$\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the $\log$ function:

$\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

Remark: Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Quadratically Independent Monomials

2013-01-31T00:00:00-08:00

Today Arun asked me the following question:

“Under what conditions will a set $\{p_1,\ldots,p_n\}$ of polynomials be quadratically independent, in the sense that $\{p_1^2, p_1p_2, p_2^2, p_1p_3,\ldots,p_{n-1}p_n, p_n^2\}$ is a linearly independent set?”

I wasn’t able to make much progress on this general question, but in the specific setting where the $p_i$ are all polynomials in one variable, and we further restrict to just monomials, (i.e. $p_i(x) = x^{d_i}$ for some $d_i$), the condition is just that there are no distinct unordered pairs $(i_1,j_1),(i_2,j_2)$ such that $d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}$. Arun was interested in the largest such a set could be for a given maximum degree $D$, so we are left with the following interesting combinatorics problem:

“What is the largest subset $S$ of $\{1,\ldots,D\}$ such that no two distinct pairs of elements of $S$ have the same sum?”

For convenience of notation let $n$ denote the size of $S$. A simple upper bound is $\binom{N+1}{2} \leq 2D-1$, since there are $\binom{N+1}{2}$ pairs to take a sum of, and all pairwise sums lie between $2$ and $2D$. We therefore have $n = O(\sqrt{D})$.

What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of $\log_2(D)$; we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?

It turns out the answer is yes, and we can do so by choosing randomly. Let each element of $\{1,\ldots,D\}$ be placed in S with probability p. Now consider any k, $2 \leq k \leq 2D$. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),…,((k-1)/2,(k+1)/2). The probability of each such pair existing is $p^2$. Note that each of these events is independent.

S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is $p^4$, and there are $\binom{(k-1)/2}{2} \leq \binom{D}{2}$ such pairs for a given $k$, hence $(D-1)\binom{D}{2} \leq D^3/2$ such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most $p^4D^3/2$.

For even $k$ we get much the same result except that the probability for a given value of $k$ comes out to the slightly more complicated formula $\binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 \leq D^2p^4/2 + Dp^3 + p^2$, so that the total probability of an even value of k invalidating S is at most $p^4D^3/2 + p^3D^2 + p^2D$.

Putting this all together gives us a bound of $p^4D^3 + p^3D^2 + p^2D$. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most $\frac{1}{16} + \frac{1}{8} D^{-\frac{1}{4}} + \frac{1}{4}D^{-\frac{1}{2}} \leq \frac{7}{16}$, so with probability at least $\frac{7}{16}$ a set S with elements chosen randomly with probability $\frac{1}{2}D^{-\frac{3}{4}}$ will be valid. On the other hand, such a set has $D^{1/4}$ elements in expectation, and asymptotically the probability of having at least this many elements is $\frac{1}{2}$. Therefore, with probability at least $\frac{1}{16}$ a randomly chosen set will be both valid and have size greater than $\frac{1}{2}$, which shows that the largest value of $n$ is at least $\Omega\left(D^{1/4}\right)$.

We can actually do better: if all elements are chosen with probability $\frac{1}{2}D^{-2/3}$, then one can show that the expected number of invalid pairs is at most $\frac{1}{8}D^{1/3} + O(1)$, and hence we can pick randomly with probability $p = \frac{1}{2}D^{-2/3}$, remove one element of each of the invalid pairs, and still be left with $\Omega(D^{1/3})$ elements in S.

So, to recap: choosing elements randomly gives us S of size $\Omega(D^{1/4})$; choosing randomly and then removing any offending pairs gives us S of size $\Omega(D^{1/3})$; and we have an upper bound of $O(D^{1/2})$. What is the actual asymptotic answer? I don’t actually know the answer to this, but I thought I’d share what I have so far because I think the techniques involved are pretty cool.

Exponential Families

2012-12-21T00:00:00-08:00

In my last post I discussed log-linear models. In this post I’d like to take another perspective on log-linear models, by thinking of them as members of an exponential family. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always have conjugate priors, which provide an analytically tractable regularization method; finally, they can be viewed as maximum-entropy models for a given set of sufficient statistics. Don’t worry if these terms are unfamiliar; I will explain all of them by the end of this post. Also note that most of this material is available on the Wikipedia page on exponential families, which I used quite liberally in preparing the below exposition.

1. Exponential Families

An exponential family is a family of probability distributions, parameterized by ${\theta \in \mathbb{R}^n}$, of the form

$p(x \mid \theta) \propto h(x)\exp(\theta^T\phi(x)).$ (1)

Notice the similarity to the definition of a log-linear model, which is

$p(x \mid \theta) \propto \exp(\theta^T\phi(x)).$ (2)

So, a log-linear model is simply an exponential family model with ${h(x) = 1}$. Note that we can re-write the right-hand-side of (1) as ${\exp(\theta^T\phi(x)+\log h(x))}$, so an exponential family is really just a log-linear model with one of the coordinates of $\theta$ constrained to equal ${1}$. Also note that the normalization constant in (1) is a function of $\theta$ (since $\theta$ fully specifies the distribution over ${x}$), so we can express (1) more explicitly as

$p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)),$ (3)

where

$A(\theta) = \log\left(\int h(x)\exp(\theta^T\phi(x)) d(x)\right).$ (4)

Exponential families are capable of capturing almost all of the common distributions you are familiar with. There is an extensive table on Wikipedia; I’ve also included some of the most common below:

Gaussian distributions. Let ${\phi(x) = \left[ \begin{array}{c} x \\ x^2\end{array} \right]}$. Then ${p(x \mid \theta) \propto \exp(\theta_1x+\theta_2x^2)}$. If we let ${\theta = \left[\frac{\mu}{\sigma^2},-\frac{1}{2\sigma^2}\right]}$, then ${p(x \mid \theta) \propto \exp(\frac{\mu x}{\sigma^2}-\frac{x^2}{2\sigma^2}) \propto \exp(-\frac{1}{2\sigma^2}(x-\mu)^2)}$. We therefore see that Gaussian distributions are an exponential family for ${\phi(x) = \left[ \begin{array}{c} x \\ x^2 \end{array} \right]}$.
Poisson distributions. Let ${\phi(x) = [x]}$ and ${h(x) = \left\{\begin{array}{ccc} \frac{1}{x!} & : & x \in \{0,1,2,\ldots\} \\ 0 & : & \mathrm{else} \end{array}\right.}$. Then ${p(k \mid \theta) \propto \frac{1}{k!}\exp(\theta x)}$. If we let ${\theta_1 = \log(\lambda)}$ then we get ${p(k \mid \theta) \propto \frac{\lambda^k}{k!}}$; we thus see that Poisson distributions are also an exponential family.
Multinomial distributions. Suppose that ${X = \{1,2,\ldots,n\}}$. Let ${\phi(k)}$ be an ${n}$-dimensional vector whose ${k}$th element is ${1}$ and where all other elements are zero. Then ${p(k \mid \theta) \propto \exp(\theta_k) \propto \frac{\exp(\theta_k)}{\sum_{k=1}^n \exp(\theta_k)}}$. If ${\theta_k = \log P(x=k)}$, then we obtain an arbitrary multinomial distribution. Therefore, multinomial distributions are also an exponential family.

2. Sufficient Statistics

A statistic of a random variable ${X}$ is any deterministic function of that variable. For instance, if ${X = [X_1,\ldots,X_n]^T}$ is a vector of Gaussian random variables, then the sample mean ${\hat{\mu} := (X_1+\ldots+X_n)/n}$ and sample variance ${\hat{\sigma}^2 := (X_1^2+\cdots+X_n^2)/n-(X_1+\cdots+X_n)^2/n^2}$ are both statistics.

Let ${\mathcal{F}}$ be a family of distributions parameterized by $\theta$, and let ${X}$ be a random variable with distribution given by some unknown ${\theta_0}$. Then a vector ${T(X)}$ of statistics are called sufficient statistics for ${\theta_0}$ if they contain all possible information about ${\theta_0}$, that is, for any function ${f}$, we have

$\mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] = S(f,T_0),$ (5)

for some function ${S}$ that has no dependence on ${\theta_0}$.

For instance, let ${X}$ be a vector of ${n}$ independent Gaussian random variables ${X_1,\ldots,X_n}$ with unknown mean ${\mu}$ and variance ${\sigma}$. It turns out that ${T(X) := [\hat{\mu},\hat{\sigma}^2]}$ is a sufficient statistic for ${\mu}$ and ${\sigma}$. This is not immediately obvious; a very useful tool for determining whether statistics are sufficient is the Fisher-Neyman factorization theorem:

Theorem 1 (Fisher-Neyman) Suppose that ${X}$ has a probability density function ${p(X \mid \theta)}$. Then the statistics ${T(X)}$ are sufficient for $\theta$ if and only if ${p(X \mid \theta)}$ can be written in the form

$p(X \mid \theta) = h(X)g_\theta(T(X)).$ (6)

In other words, the probability of ${X}$ can be factored into a part that does not depend on $\theta$, and a part that depends on $\theta$ only via ${T(X)}$.

What is going on here, intuitively? If ${p(X \mid \theta)}$ depended only on ${T(X)}$, then ${T(X)}$ would definitely be a sufficient statistic. But that isn’t the only way for ${T(X)}$ to be a sufficient statistic — ${p(X \mid \theta)}$ could also just not depend on $\theta$ at all, in which case ${T(X)}$ would trivially be a sufficient statistic (as would anything else). The Fisher-Neyman theorem essentially says that the only way in which ${T(X)}$ can be a sufficient statistic is if its density is a product of these two cases.

Proof: If (6) holds, then we can check that (5) is satisfied:

$\begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \frac{\int_{T(X) = T_0} f(X) dp(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} dp(X \mid \theta=\theta_0)}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)g_\theta(T_0) dX}{\int_{T(X)=T_0} h(X)g_\theta(T_0) dX}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)dX}{\int_{T(X)=T_0} h(X) dX}, \end{array} $

where the right-hand-side has no dependence on $\theta$.

On the other hand, if we compute ${\mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0]}$ for an arbitrary density ${p(X)}$, we get

$\begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \int_{T(X) = T_0} f(X) \frac{p(X \mid \theta=\theta_0)}{\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX} dX. \end{array} $

If the right-hand-side cannot depend on $\theta$ for any choice of ${f}$, then the term that we multiply ${f}$ by must not depend on $\theta$; that is, ${\frac{p(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} p(X \mid \theta=\theta_0) dX}}$ must be some function ${h_0(X, T_0)}$ that depends only on ${X}$ and ${T_0}$ and not on $\theta$. On the other hand, the denominator ${\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX}$ depends only on ${\theta_0}$ and ${T_0}$; call this dependence ${g_{\theta_0}(T_0)}$. Finally, note that ${T_0}$ is a deterministic function of ${X}$, so let ${h(X) := h_0(X,T(X))}$. We then see that ${p(X \mid \theta=\theta_0) = h_0(X, T_0)g_{\theta_0}(T_0) = h(X)g_{\theta_0}(T(X))}$, which is the same form as (6), thus completing the proof of the theorem. $\Box$

Now, let us apply the Fisher-Neyman theorem to exponential families. By definition, the density for an exponential family factors as

$p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)). $

If we let ${T(x) = \phi(x)}$ and ${g_\theta(\phi(x)) = \exp(\theta^T\phi(x)-A(\theta))}$, then the Fisher-Neyman condition is met; therefore, ${\phi(x)}$ is a vector of sufficient statistics for the exponential family. In fact, we can go further:

Theorem 2 Let ${X_1,\ldots,X_n}$ be drawn independently from an exponential family distribution with fixed parameter $\theta$. Then the empirical expectation ${\hat{\phi} := \frac{1}{n} \sum{i=1}^n \phi(X_i)}$ is a sufficient statistic for $\theta$._

Proof: The density for ${X_1,\ldots,X_n}$ given $\theta$ is

$\begin{array}{rcl} p(X_1,\ldots,X_n \mid \theta) &=& h(X_1)\cdots h(X_n) \exp(\theta^T\sum_{i=1}^n \phi(X_i) - nA(\theta)) \\ &=& h(X_1)\cdots h(X_n)\exp(n [\hat{\phi}-A(\theta)]). \end{array} $

Letting ${h(X_1,\ldots,X_n) = h(X_1)\cdots h(X_n)}$ and ${g_\theta(\hat{\phi}) = \exp(n[\hat{\phi}-A(\theta)])}$, we see that the Fisher-Neyman conditions are satisfied, so that ${\hat{\phi}}$ is indeed a sufficient statistic. $\Box$

Finally, we note (without proof) the same relationship as in the log-linear case to the gradient and Hessian of ${p(X_1,\ldots,X_n \mid \theta)}$ with respect to the model parameters:

Theorem 3 Again let ${X_1,\ldots,X_n}$ be drawn from an exponential family distribution with parameter $\theta$. Then the gradient of ${p(X_1,\ldots,X_n \mid \theta)}$ with respect to $\theta$ is

$n \times \left(\hat{\phi}-\mathbb{E}[\phi \mid \theta]\right) $

and the Hessian is

$n \times \left(\mathbb{E}[\phi \mid \theta]\mathbb{E}[\phi \mid \theta]^T - \mathbb{E}[\phi\phi^T \mid \theta]\right). $

This theorem provides an efficient algorithm for fitting the parameters of an exponential family distribution (for details on the algorithm, see the part near the end of the log-linear models post on parameter estimation).

3. Moments of an Exponential Family

If ${X}$ is a real-valued random variable, then the ${p}$th moment of ${X}$ is ${\mathbb{E}[X^p]}$. In general, if ${X = [X_1,\ldots,X_n]^T}$ is a random variable on ${\mathbb{R}^n}$, then for every sequence ${p_1,\ldots,p_n}$ of non-negative integers, there is a corresponding moment ${M_{p_1,\cdots,p_n} := \mathbb{E}[X_1^{p_1}\cdots X_n^{p_n}]}$.

In exponential families there is a very nice relationship between the normalization constant ${A(\theta)}$ and the moments of ${X}$. Before we establish this relationship, let us define the moment generating function of a random variable ${X}$ as ${f(\lambda) = \mathbb{E}[\exp(\lambda^TX)]}$.

Lemma 4 The moment generating function for a random variable ${X}$ is equal to

$\sum_{p_1,\ldots,p_n=0}^{\infty} M_{p_1,\cdots,p_n} \frac{\lambda_1^{p_1}\cdots \lambda_n^{p_n}}{p_1!\cdots p_n!}. $

The proof of Lemma 4 is a straightforward application of Taylor’s theorem, together with linearity of expectation (note that in one dimension, the expression in Lemma 4 would just be ${\sum_{p=0}^{\infty} \mathbb{E}[X^p] \frac{\lambda^p}{p!}}$).

We now see why ${f(\lambda)}$ is called the moment generating function: it is the exponential generating function for the moments of ${X}$. The moment generating function for the sufficient statistics of an exponential family is particularly easy to compute:

Lemma 5 If ${p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))}$, then ${\mathbb{E}[\exp(\lambda^T\phi(x))] = \exp(A(\theta+\lambda)-A(\theta))}$.

Proof:

$\begin{array}{rcl} \mathbb{E}[\exp(\lambda^Tx)] &=& \int \exp(\lambda^Tx) p(x \mid \theta) dx \\ &=& \int \exp(\lambda^Tx)h(x)\exp(\theta^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta+\lambda))dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \int p(x \mid \theta+\lambda) dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \exp(A(\theta+\lambda)-A(\theta)), \end{array} $

where the last step uses the fact that ${p(x \mid \theta+\lambda)}$ is a probability density and hence ${\int p(x \mid \theta+\lambda) dx = 1}$. $\Box$

Now, by Lemma 4, ${M_{p_1,\cdots,p_n}}$ is just the ${(p_1,\ldots,p_n)}$ coefficient in the Taylor series for the moment generating function ${f(\lambda)}$, and hence we can compute ${M_{p_1,\cdots,p_n}}$ as ${\frac{\partial^{p_1+\cdots+p_n} f(\lambda)}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}}$. Combining this with Lemma 5 gives us a closed-form expression for ${M_{p_1,\cdots,p_n}}$ in terms of the normalization constant ${A(\theta)}$:

Lemma 6 The moments of an exponential family can be computed as

$M_{p_1,\ldots,p_n} = \frac{\partial^{p_1+\cdots+p_n} \exp(A(\theta+\lambda)-A(\theta))}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}. $

For those who prefer cumulants to moments, I will note that there is a version of Lemma 6 for cumulants with an even simpler formula.

Exercise: Use Lemma 6 to compute ${\mathbb{E}[X^6]}$, where ${X}$ is a Gaussian with mean ${\mu}$ and variance ${\sigma^2}$.

4. Conjugate Priors

Given a family of distributions ${p(X \mid \theta)}$, a conjugate prior family ${p(\theta \mid \alpha)}$ is a family that has the property that

$p(\theta \mid X, \alpha) = p(\theta \mid \alpha’) $

for some ${\alpha’}$ depending on ${\alpha}$ and ${X}$. In other words, if the prior over $\theta$ lies in the conjugate family, and we observe ${X}$, then the posterior over $\theta$ also lies in the conjugate family. This is very useful algebraically as it means that we can get our posterior simply by updating the parameters of the prior. The following are examples of conjugate families:

Gaussian-Gaussian

Let ${p(X \mid \mu) \propto \exp((X-\mu)^2/2)}$, and let ${p(\mu \mid \mu_0, \sigma_0) \propto \exp((\mu-\mu_0)^2/2\sigma_0^2)}$. Then, by Bayes’ rule,

$\begin{array}{rcl} p(\mu \mid X=x, \mu_0, \sigma_0) &\propto \exp((x-\mu)^2/2)\exp((\mu-\mu_0)^2/2\sigma_0^2) \\ &= &\exp\left(\frac{(\mu-\mu_0)^2+\sigma_0^2(\mu-x)^2}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{(1+\sigma_0)^2\mu^2-2(\mu_0+\sigma_0^2x)\mu}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{\mu^2-2\frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}\mu}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& \exp\left(\frac{(\mu-(\mu_0+x\sigma_0^2)/(1+\sigma_0^2))^2}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& p\left(\mu \mid \frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}, \frac{\sigma_0}{\sqrt{1+\sigma_0^2}}\right). \end{array} $

Therefore, ${\mu_0, \sigma_0}$ parameterize a family of priors over ${\mu}$ that is conjugate to ${X \mid \mu}$.

Beta-Bernoulli

Let ${X \in \{0,1\}}$, ${\theta \in [0,1]}$, ${p(X=1 \mid \theta) = \theta}$, and ${p(\theta \mid \alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}}$. The distribution over ${X}$ given $\theta$ is then called a Bernoulli distribution, and that of $\theta$ given ${\alpha}$ and ${\beta}$ is called a beta distribution. Note that ${p(X\mid \theta)}$ can also be written as ${\theta^X(1-\theta)^{1-X}}$. From this, we see that the family of beta distributions is a conjugate prior to the family of Bernoulli distributions, since

$\begin{array}{rcl} p(\theta \mid X=x, \alpha, \beta) &\propto& \theta^x(1-\theta)^{1-x} \times \theta^{\alpha-1}(1-\theta)^{\beta-1} \\ &=& \theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1} \\ &\propto& p(\theta \mid \alpha+x, \beta+(1-x)). \end{array} $

Gamma-Poisson

Let ${p(X=k \mid \lambda) = \frac{\lambda^k}{e^{\lambda}k!}}$ for ${k \in \mathbb{Z}{\geq 0}}$. Let ${p(\lambda \mid \alpha, \beta) \propto \lambda^{\alpha-1}\exp(-\beta \lambda)}$. As noted before, the distribution for ${X}$ given ${\lambda}$ is called a _Poisson distribution; the distribution for ${\lambda}$ given ${\alpha}$ and ${\beta}$ is called a gamma distribution. We can check that the family of gamma distributions is conjugate to the family of Poisson distributions.Important note: unlike in the last two examples, the normalization constant for the Poisson distribution actually depends on ${\lambda}$, and so we need to include it in our calculations:

$\begin{array}{rcl} p(\lambda \mid X=k, \alpha, \beta) &\propto& \frac{\lambda^k}{e^{\lambda}k!} \times \lambda^{\alpha-1}\exp(-\beta\lambda) \\ &\propto& \lambda^{\alpha+k-1}\exp(-(\beta+1)\lambda) \\ &\propto& p(\lambda \mid \alpha+k, \beta+1). \end{array} $

Note that, in general, a family of distributions will always have some conjugate family, as if nothing else the family of all probability distributions over $\theta$ will be a conjugate family. What we really care about is a conjugate family that itself has nice properties, such as tractably computable moments.

Conjugate priors have a very nice relationship to exponential families, established in the following theorem:

Theorem 7 Let ${p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))}$ be an exponential family. Then ${p(\theta \mid \eta, \kappa) \propto h_2(\theta)\exp\left(\eta^T\theta-\kappa A(\theta)\right)}$ is a conjugate prior for ${x \mid \theta}$ for any choice of ${h_2}$. The update formula is ${p(\theta \mid x, \eta, \kappa) = p(\theta \mid \eta+\phi(x), \kappa+1)}$. Furthermore, ${\theta \mid \phi, \kappa}$ is itself an exponential family, with sufficient statistics ${[\theta; A(\theta)]}$.

Checking the theorem is a matter of straightforward algebra, so I will leave the proof as an exercise to the reader. Note that, as before, there is no guarantee that ${p(\theta \mid \eta, \kappa)}$ will be tractable; however, in many cases the conjugate prior given by Theorem 7 is a well-behaved family. See this Wikipedia page for examples of conjugate priors, many of which correspond to exponential family distributions.

5. Maximum Entropy and Duality

The final property of exponential families I would like to establish is a certain duality property. What I mean by this is that exponential families can be thought of as the maximum entropy distributions subject to a constraint on the expected value of their sufficient statistics. For those unfamiliar with the term, the entropy of a distribution over ${X}$ with density ${p(X)}$ is ${\mathbb{E}[-\log p(X)] := -\int p(x)\log(p(x)) dx}$. Intuitively, higher entropy corresponds to higher uncertainty, so a maximum entropy distribution is one specifying as much uncertainty as possible given a certain set of information (such as the values of various moments). This makes them appealing, at least in theory, from a modeling perspective, since they “encode exactly as much information as is given and no more”. (Caveat: this intuition isn’t entirely valid, and in practice maximum-entropy distributions aren’t always necessarily appropriate.)

In any case, the duality property is captured in the following theorem:

Theorem 8 The distribution over ${X}$ with maximum entropy such that ${\mathbb{E}[\phi(X)] = T}$ lies in the exponential family with sufficient statistic ${\phi(X)}$ and ${h(X) = 1}$.

Proving this fully rigorously requires the calculus of variations; I will instead give the “physicist’s proof”. Proof: } Let ${p(X)}$ be the density for ${X}$. Then we can view ${p}$ as the solution to the constrained maximization problem:

$\begin{array}{rcl} \mathrm{maximize} && -\int p(X) \log p(X) dX \\ \mathrm{subject \ to} && \int p(X) dX = 1 \\ && \int p(X) \phi(X) dX = T. \end{array} $

By the method of Lagrange multipliers, there exist ${\alpha}$ and ${\lambda}$ such that

$\frac{d}{dp}\left(-\int p(X)\log p(X) dX - \alpha [\int p(X) dX-1] - \lambda^T[\int \phi(X) p(X) dX-T]\right) = 0. $

This simplifies to:

$-\log p(X) - 1 - \alpha -\lambda^T \phi(X) = 0, $

which implies

$p(X) = \exp(-1-\alpha-\lambda^T\phi(X)) $

for some ${\alpha}$ and ${\lambda}$. In particular, if we let ${\lambda = -\theta}$ and ${\alpha = A(\theta)-1}$, then we recover the exponential family with ${h(X) = 1}$, as claimed. $\Box$

6. Conclusion

Hopefully I have by now convinced you that exponential families have many nice properties: they have conjugate priors, simple-to-fit parameters, and easily-computed moments. While exponential families aren’t always appropriate models for a given situation, their tractability makes them the model of choice when no other information is present; and, since they can be obtained as maximum-entropy families, they are actually appropriate models in a wide family of circumstances.

Algebra trick of the day

2012-12-17T00:00:00-08:00

I’ve decided to start recording algebra tricks as I end up using them. Today I actually have two tricks, but they end up being used together a lot. I don’t know if they have more formal names, but I call them the “trace trick” and the “rank 1 relaxation”.

Suppose that we want to maximize the Rayleigh quotient $\frac{x^TAx}{x^Tx}$ of a matrix $A$. There are many reasons we might want to do this, for instance of $A$ is symmetric then the maximum corresponds to the largest eigenvalue. There are also many ways to do this, and the one that I’m about to describe is definitely not the most efficient, but it has the advantage of being flexible, in that it easily generalizes to constrained maximizations, etc.

The first observation is that $\frac{x^TAx}{x^Tx}$ is homogeneous, meaning that scaling $x$ doesn’t affect the result. So, we can assume without loss of generality that $x^Tx = 1$, and we end up with the optimization problem:

maximize $x^TAx$

subject to $x^Tx = 1$

This is where the trace trick comes in. Recall that the trace of a matrix is the sum of its diagonal entries. We are going to use two facts: first, the trace of a number is just the number itself. Second, trace(AB) = trace(BA). (Note, however, that trace(ABC) is not in general equal to trace(BAC), although trace(ABC) is equal to trace(CAB).) We use these two properties as follows — first, we re-write the optimization problem as:

maximize $Trace(x^TAx)$

subject to $Trace(x^Tx) = 1$

Second, we re-write it again using the invariance of trace under cyclic permutations:

maximize $Trace(Axx^T)$

subject to $Trace(xx^T) = 1$

Now we make the substitution $X = xx^T$:

maximize $Trace(AX)$

subject to $Trace(X) = 1, X = xx^T$

Finally, note that a matrix $X$ can be written as $xx^T$ if and only if $X$ is positive semi-definite and has rank 1. Therefore, we can further write this as

maximize $Trace(AX)$

subject to $Trace(X) = 1, Rank(X) = 1, X \succeq 0$

Aside from the rank 1 constraint, this would be a semidefinite program, a type of problem that can be solved efficiently. What happens if we drop the rank 1 constraint? Then I claim that the solution to this program would be the same as if I had kept the constraint in! Why is this? Let’s look at the eigendecomposition of $X$, written as $\sum_{i=1}^n \lambda_i x_ix_i^T$, with $\lambda_i \geq 0$ (by positive semidefiniteness) and $\sum_{i=1}^n \lambda_i = 1$ (by the trace constraint). Let’s also look at $Trace(AX)$, which can be written as $\sum_{i=1}^n \lambda_i Trace(Ax_ix_i^T)$. Since $Trace(AX)$ is just a convex combination of the $Trace(Ax_ix_i^T)$, we might as well have just picked $X$ to be $x_ix_i^T$, where $i$ is chosen to maximize $Trace(Ax_ix_i^T)$. If we set that $\lambda_i$ to 1 and all the rest to 0, then we maintain all of the constraints while increasing $Trace(AX)$, meaning that we couldn’t have been at the optimum value of $X$ unless $n$ was equal to 1. What we have shown, then, is that the rank of $X$ must be 1, so that the rank 1 constraint was unnecessary.

Technically, $X$ could be a linear combination of rank 1 matrices that all have the same value of $Trace(AX)$, but in that case we could just pick any one of those matrices. So what I have really shown is that at least one optimal point has rank 1, and we can recover such a point from any solution, even if the original solution was not rank 1.

Here is a problem that uses a similar trick. Suppose we want to find $x$ that simultaneously satisfies the equations:

$b_i = |a_i^Tx|^2$

for each $i = 1,\ldots,n$ (this example was inspired from the recent NIPS paper by Ohlsson, Yang, Dong, and Sastry, although the idea itself goes at least back to Candes, Strohmer, and Voroninski). Note that this is basically equivalent to solving a system of linear equations where we only know each equation up to a sign (or a phase, in the complex case). Therefore, in general, this problem will not have a unique solution. To ensure the solution is unique, let us assume the very strong condition that whenever $a_i^TVa_i = 0$ for all $i = 1,\ldots,n$, the matrix $V$ must itself be zero (note: Candes et al. get away with a much weaker condition). Given this, can we phrase the problem as a semidefinite program? I highly recommend trying to solve this problem on your own, or at least reducing it to a rank-constrained SDP, so I’ll include the solution below a fold.

Solution. We can, as before, re-write the equations as:

$b_i = Trace(a_ia_i^Txx^T)$

and further write this as

$b_i = Trace(a_ia_i^TX), X \succeq 0, rank(X) = 1$

As before, drop the rank 1 constraint and let $X = \sum_{j=1}^m \lambda_j x_jx_j^T$. Then we get:

$b_i = \sum_{j=1}^m Trace(a_ia_i^Tx_jx_j^T)\lambda_j$,

which we can re-write as $b_i = a_i^T\left(\sum_{j=1}^m \lambda_jx_jx_j^T\right)a_i$. But if $x^*$ is the true solution, then we also know that $b_i = a_i^Tx^*(x^*)^Ta_i$, so that $a_i^T\left(-x^*(x^*)^T+\sum_{j=1}^m \lambda_jx_jx_j^T\right) = 0$ for all $i$. By the non-degeneracy assumption, this implies that

$x^*(x^*)^T = \sum_{j=1}^m \lambda_jx_jx_j^T$,

so in particular $X = x^*(x^*)^T$. Therefore, $X = x^*(x^*)^T$ is the only solution to the semidefinite program even after dropping the rank constraint.

Log-Linear Models

2012-12-06T00:00:00-08:00

I’ve spent most of my research career trying to build big, complex nonparametric models; however, I’ve more recently delved into the realm of natural language processing, where how awesome your model looks on paper is irrelevant compared to how well it models your data. In the spirit of this new work (and to lay the groundwork for a later post on NLP), I’d like to go over a family of models that I think is often overlooked due to not being terribly sexy (or at least, I overlooked it for a good while). This family is the family of log-linear models, which are models of the form:

$p(x \mid \theta) \propto e^{\phi(x)^T\theta}, $

where ${\phi}$ maps a data point to a feature vector; they are called log-linear because the log of the probability is a linear function of ${\phi(x)}$. We refer to ${\phi(x)^T\theta}$ as the score of ${x}$.

This model class might look fairly restricted at first, but the real magic comes in with the feature vector ${\phi}$. In fact, every probabilistic model that is absolutely continuous with respect to Lebesgue measure can be represented as a log-linear model for sufficient choices of ${\phi}$ and $\theta$. This is actually trivially true, as we can just take ${\phi : X \rightarrow \mathbb{R}}$ to be ${\log p(x)}$ and $\theta$ to be ${1}$.

You might object to this choice of ${\phi}$, since it maps into ${\mathbb{R}}$ rather than ${\{0,1\}^n}$, and feature vectors are typically discrete. However, we can do just as well by letting ${\phi : X \rightarrow \{0,1\}^{\infty}}$, where the ${i}$th coordinate of ${\phi(x)}$ is the ${i}$th digit in the binary representation of ${\log p(x)}$, then let $\theta$ be the vector ${\left(\frac{1}{2},\frac{1}{4},\frac{1}{8},\ldots\right)}$.

It is important to distinguish between the ability to represent an arbitrary model as log-linear and the ability to represent an arbitrary family of models as a log-linear family (that is, as the set of models we get if we fix a choice of features ${\phi}$ and then vary $\theta$). When we don’t know the correct model in advance and want to learn it, this latter consideration can be crucial. Below, I give two examples of model families and discuss how they fit (or do not fit) into the log-linear framework. Important caveat: in both of the models below, it is typically the case that at least some of the variables involved are unobserved. However, we will ignore this for now, and assume that, at least at training time, all of the variables are fully observed (in other words, we can see ${x_i}$ and ${y_i}$ in the hidden Markov model and we can see the full tree of productions in the probabilistic context free grammar).

Hidden Markov Models. A hidden Markov model, or HMM, is a model with latent (unobserved) variables ${x_1,\ldots,x_n}$ together with observed variables ${y_1,\ldots,y_n}$. The distribution for ${y_i}$ depends only on ${x_i}$, and the distribution for ${x_i}$ depends only on ${x_{i-1}}$ (in the sense that ${x_i}$ is conditionally independent of ${x_1,\ldots,x_{i-2}}$ given ${x_{i-1}}$). We can thus summarize the information in an HMM with the distributions ${p(x_{i} = t \mid x_{i-1} = s)}$ and ${p(y_i = u \mid x_i = s)}$.

We can express a hidden Markov model as a log-linear model by defining two classes of features: (i) features ${\phi_{s,t}}$ that count the number of ${i}$ such that ${x_{i-1} = s}$ and ${x_i = t}$; and (ii) features ${\psi_{s,u}}$ that count the number of ${i}$ such that ${x_i = s}$ and ${y_i = u}$. While this choice of features yields a model family capable of expressing an arbitrary hidden Markov model, it is also capable of learning models that are not hidden Markov models. In particular, we would like to think of ${\theta_{s,t}}$ (the index of $\theta$ corresponding to ${\phi_{s,t}}$) as ${\log p(x_i=t \mid x_{i-1}=s)}$, but there is no constraint that ${\sum_{t} \exp(\theta_{s,t}) = 1}$ for each ${s}$, whereas we do necessarily have ${\sum_{t} p(x_i = t \mid x_{i-1}=s) = 1}$ for each ${s}$. If ${n}$ is fixed, we still do obtain an HMM for any setting of $\theta$, although ${\theta_{s,t}}$ will have no simple relationship with ${\log p(x_i = t \mid x_{i-1} = s)}$. Furthermore, the relationship depends on ${n}$, and will therefore not work if we care about multiple Markov chains with different lengths.

Is the ability to express models that are not HMMs good or bad? It depends. If we know for certain that our data satisfy the HMM assumption, then expanding our model family to include models that violate that assumption can only end up hurting us. If the data do not satisfy the HMM assumption, then increasing the size of the model family may allow us to overcome what would otherwise be a model mis-specification. I personally would prefer to have as much control as possible about what assumptions I make, so I tend to see the over-expressivity of HMMs as a bug rather than a feature.

Probabilistic Context Free Grammars. A probabilistic context free grammar, or PCFG, is simply a context free grammar where we place a probability distribution over the production rules for each non-terminal. For those unfamiliar with context free grammars, a context free grammar is specified by:

A set ${\mathcal{S}}$ of non-terminal symbols, including a distinguished initial symbol ${E}$.
A set ${\mathcal{T}}$ of terminal symbols.
For each ${s \in S}$, one or more production rules of the form ${s \mapsto w_1w_2\cdots w_k}$, where ${k \geq 0}$ and ${w_i \in \mathcal{S} \cup \mathcal{T}}$.

For instance, a context free grammar for arithmetic expressions might have ${\mathcal{S} = \{E\}}$, ${\mathcal{T} = \{+,-,\times,/,(,)\} \cup \mathbb{R}}$, and the following production rules:

${E \mapsto x}$ for all ${x \in \mathbb{R}}$
${E \mapsto E + E}$
${E \mapsto E - E}$
${E \mapsto E \times E}$
${E \mapsto E / E}$
${E \mapsto (E)}$

The language corresponding to a context free grammar is the set of all strings that can be obtained by starting from ${E}$ and applying production rules until we only have terminal symbols. The language corresponding to the above grammar is, in fact, the set of well-formed arithmetic expressions, such as ${5-4-2}$, ${2-3\times (4.3)}$, and ${5/9927.12/(3-3\times 1)}$.

As mentioned above, a probabilistic context free grammar simply places a distribution over the production rules for any given non-terminal symbol. By repeatedly sampling from these distributions until we are left with only terminal symbols, we obtain a probability distribution over the language of the grammar.

We can represent a PCFG as a log-linear model by using a feature ${\phi_r}$ for each production rule ${r}$. For instance, we have a feature that counts the number of times that the rule ${E \mapsto E + E}$ gets applied, and another feature that counts the number of times that ${E \mapsto (E)}$ gets applied. Such features yield a log-linear model family that contains all probabilistic context free grammars for a given (deterministic) context free grammar. However, it also contains additional models that do not correspond to PCFGs; this is because we run into the same problem as for HMMs, which is that the sum of ${\exp(\theta_r)}$ over production rules of a given non-terminal does not necessarily add up to ${1}$. In fact, the problem is even worse here. For instance, suppose that ${\theta_{E \mapsto E + E} = 0.1}$ in the model above. Then the expression ${E+E+E+E+E+E}$ gets a score of ${0.5}$, and longer chains of ${E}$s get even higher scores. In particular, there is an infinite sequence of expressions with increasing scores and therefore the model doesn’t normalize (since the sum of the exponentiated scores of all possible productions is infinite).

So, log-linear models over-represent PCFGs in the same way as they over-represent HMMs, but the problems are even worse than before. Let’s ignore these issues for now, and suppose that we want to learn PCFGs with an unknown underlying CFG. To be a bit more concrete, suppose that we have a large collection of possible production rules for each non-terminal ${s \in \mathcal{S}}$, and we think that a small but unknown subset of those production rules should actually appear in the grammar. Then there is no way to encode this directly within the context of a log-linear model family, although we can encode such “sparsity constraints” using simple extensions to log-linear models (for instance, by adding a penalty for the number of non-zero entries in $\theta$). So, we have found another way in which the log-linear representation is not entirely adequate.

Conclusion. Based on the examples above, we have seen that log-linear models have difficulty placing constraints on latent variables. This showed up in two different ways: first, we are unable to constrain subsets of variables to add up to ${1}$ (what I call “local normalization” constraints); second, we are unable to encode sparsity constraints within the model. In both of these cases, it is possible to extend the log-linear framework to address these sorts of constraints, although that is outside the scope of this post.

Parameter Estimation for Log-Linear Models

I’ve explained what a log-linear model is, and partially characterized its representational power. I will now answer the practical question of how to estimate the parameters of a log-linear model (i.e., how to fit $\theta$ based on observed data). Recall that a log-linear model places a distribution over a space ${X}$ by choosing ${\phi : X \rightarrow \mathbb{R}^n}$ and ${\theta \in \mathbb{R}^n}$ and defining

$p(x \mid \theta) \propto \exp(\phi(x)^T\theta)$

More precisely (assuming ${X}$ is a discrete space), we have

$p(x \mid \theta) = \frac{\exp(\phi(x)^T\theta)}{\sum_{x’ \in X} \exp(\phi(x’)^T\theta)}$

Given observations ${x_1,\ldots,x_n}$, which we assume to be independent given $\theta$, our goal is to choose $\theta$ maximizing ${p(x_1,\ldots,x_n \mid \theta)}$, or, equivalently, ${\log p(x_1,\ldots,x_n \mid \theta)}$. In equations, we want

$\theta^* = \arg\max\limits_{\theta} \sum_{i=1}^n \left[\phi(x_i)^T\theta - \log\left(\sum_{x \in X} \exp(\phi(x)^T\theta)\right) \right]. \ \ \ \ \ (1)$

We typically use gradient methods (such as gradient descent, stochastic gradient descent, or L-BFGS) to minimize the right-hand side of (1). If we compute the gradient of (1) then we get:

$\sum_{i=1}^n \left(\phi(x_i)-\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right). \ \ \ \ \ (2)$

We can re-write (2) in the following more compact form:

$\sum_{i=1}^n \left(\phi(x_i) - \mathbb{E}[\phi(x) \mid \theta]\right). \ \ \ \ \ (3)$

In other words, the contribution of each training example ${x_i}$ to the gradient is the extent to which the features values for ${x_i}$ exceed their expected values conditioned on $\theta$.

One important consideration for such gradient-based numerical optimizers is convexity. If the objective function we are trying to minimize is convex (or concave), then gradient methods are guaranteed to converge to the global optimum. If the objective function is non-convex, then a gradient-based approach (or any other type of local search) may converge to a local optimum that is very far from the global optimum. In order to assess convexity, we compute the Hessian (matrix of second derivatives) and check whether it is positive definite. (In this case, we actually care about concavity, so we want the Hessian to be negative definite.) We can compute the Hessian by differentiating (2), which gives us

$n \times \left[\left(\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right)\left(\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right)^T-\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)\phi(x)^T}{\sum_{x \in X} \exp(\phi(x)^T \theta)}\right]. \ \ \ \ \ (4)$

Again, we can re-write this more compactly as

$n\times \left(\mathbb{E}[\phi(x) \mid \theta]\mathbb{E}[\phi(x) \mid \theta]^T - \mathbb{E}[\phi(x)\phi(x)^T \mid \theta]\right). \ \ \ \ \ (5)$

The term inside the parentheses of (5) is exactly the negative of the covariance matrix of ${\phi(x)}$ given $\theta$, and is therefore necessarily negative definite, so the objective function we are trying to minimize is indeed concave, which, as noted before, implies that our gradient methods will always reach the global optimum.

Regularization and Concavity

We may in practice wish to encode additional prior knowledge about $\theta$ in our model, especially if the dimensionality of $\theta$ is large relative to the amount of data we have. Can we do this and still maintain concavity? The answer in many cases is yes: since the ${L^p}$-norm is convex for all ${p \geq 1}$, we can add an ${L^p}$ penalty to the objective for any such ${p}$ and still have a concave objective function.

Conclusion

Log-linear models provide a universal representation for individual probability distributions, but not for arbitrary families of probability distributions (for instance, due to the inability to capture local normalization constraints or sparsity constraints). However, for the families they do express, parameter optimization can be performed efficiently due to a likelihood function that is log-concave in its parameters. Log-linear models also have tie-ins to many other beautiful areas of statistics, such as exponential families, which will be the subject of the next post.

Beyond Bayesians and Frequentists

2012-10-31T00:00:00-07:00

(This is available in pdf form here.)

If you are a newly initiated student into the field of machine learning, it won’t be long before you start hearing the words “Bayesian” and “frequentist” thrown around. Many people around you probably have strong opinions on which is the “right” way to do statistics, and within a year you’ve probably developed your own strong opinions (which are suspiciously similar to those of the people around you, despite there being a much greater variance of opinion between different labs). In fact, now that the year is 2012 the majority of new graduate students are being raised as Bayesians (at least in the U.S.) with frequentists thought of as stodgy emeritus professors stuck in their ways.

If you are like me, the preceding set of facts will make you very uneasy. They will make you uneasy because simple pattern-matching – the strength of people’s opinions, the reliability with which these opinions split along age boundaries and lab boundaries, and the ridicule that each side levels at the other camp – makes the “Bayesians vs. frequentists” debate look far more like politics than like scholarly discourse. Of course, that alone does not necessarily prove anything; these disconcerting similarities could just be coincidences that I happened to cherry-pick.

My next point, then, is that we are right to be uneasy, because such debate makes us less likely to evaluate the strengths and weaknesses of both approaches in good faith. This essay is a push against that — I summarize the justifications for Bayesian methods and where they fall short, show how frequentist approaches can fill in some of their shortcomings, and then present my personal (though probably woefully under-informed) guidelines for choosing which type of approach to use.

Before doing any of this, though, a bit of background is in order…

1. Background on Bayesians and Frequentists

1.1. Three Levels of Argument

As Andrew Critch [6] insightfully points out, the Bayesians vs. frequentists debate is really three debates at once, centering around one or more of the following arguments:

Whether to interpret subjective beliefs as probabilities
Whether to interpret probabilities as subjective beliefs (as opposed to asymptotic frequencies)
Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem.

Given my own research interests, I will add a fourth argument:

Whether Bayesian or frequentist techniques are better suited to engineering an artificial intelligence.

Andrew Gelman [9] has his own well-written essay on the subject, where he expands on these distinctions and presents his own more nuanced view.

Why are these arguments so commonly conflated? I’m not entirely sure; I would guess it is for historical reasons but I have so far been unable to find said historical reasons. Whatever the reasons, what this boils down to in the present day is that people often form opinions on 1. and 2., which then influence their answers to 3. and 4. This is not good, since 1. and 2. are philosophical in nature and difficult to resolve correctly, whereas 3. and 4. are often much easier to resolve and extremely important to resolve correctly in practice. Let me re-iterate: the Bayes vs. frequentist discussion should center on the practical employment of the two methods, or, if epistemology must be discussed, it should be clearly separated from the day-to-day practical decisions. Aside from the difficulties with correctly deciding epistemology, the relationship between generic epistemology and specific practices in cutting-edge statistical research is only via a long causal chain, and it should be completely unsurprising if Bayesian epistemology leads to the employment of frequentist tools or vice versa.

For this reason and for reasons of space, I will spend the remainder of the essay focusing on statistical algorithms rather than on interpretations of probability. For those who really want to discuss interpretations of probability, I will address that in a later essay.

1.2. Recap of Bayesian Decision Theory

(What follows will be review for many.) In Bayesian decision theory, we assume that there is some underlying world state $\theta$ and a likelihood function ${p(X_1, \ldots, Xn \mid \theta)}$ over possible observations. (A likelihood function is just a conditional probability distribution where the parameter conditioned on can vary.) We also have a space ${A}$ of possible actions and a utility function ${U(\theta; a)}$ that gives the utility of performing action ${a}$ if the underlying world state is $\theta$. We can incorporate notions like planning and value of information by defining ${U(\theta; a)}$ recursively in terms of an identical agent to ourselves who has seen one additional observation (or, if we are planning against an adversary, in terms of the adversary). For a more detailed overview of this material, see the tutorial by North [11].

What distinguishes the Bayesian approach in particular is one additional assumption, a prior distribution ${p(\theta)}$ over possible world states. To make a decision with respect to a given prior, we compute the posterior distribution ${p_{\mathrm{posterior}}(\theta \mid X_1, \ldots, X_n)}$ using Bayes’ theorem, then take the action ${a}$ that maximizes ${\mathbb{E}_{p_{\mathrm{posterior}}}[U(\theta; a)]}$.

In practice, ${p_{\mathrm{posterior}}}$ can be quite difficult to compute, and so we often attempt to approximate it. Such attempts are known as approximate inference algorithms.

1.3. Steel-manning Frequentists

There are many different ideas that fall under the broad umbrella of frequentist techniques. While it would be impossible to adequately summarize all of them even if I attempted to, there are three in particular that I would like to describe, and which I will call frequentist decision theory, frequentist guarantees, and frequentist analysis tools.

Frequentist decision theory has a very similar setup to Bayesian decision theory, with a few key differences. These are discussed in detail and contrasted with Bayesian decision theory in [10], although we summarize the differences here. There is still a likelihood function ${p(X_1, \ldots, X_n | \theta)}$ and a utility function ${U(\theta; a)}$. However, we do not assume the existence of a prior on $\theta$, and instead choose the decision rule ${a(X_1, \ldots, X_n)}$ that maximizes

$\displaystyle \min\limits_{\theta} \mathbb{E}[U(a(X_1,\ldots,X_n); \theta) \mid \theta]. \ \ \ \ \ (1)$

In other words, we ask for a worst case guarantee rather than an average case guarantee. As an example of how these would differ, imagine a scenario where we have no data to observe, an unknown ${\theta \in \{1,\ldots,N\}}$, and we choose an action ${a \in \{0,\ldots,N\}}$. Furthermore, ${U(0; \theta) = 0}$ for all $\theta$, ${U(a; \theta) = -1}$ if ${a = \theta}$, and ${U(a;\theta) = 1}$ if ${a \neq 0}$ and ${a \neq \theta}$. Then a frequentist will always choose ${a = 0}$ because any other action gets ${-1}$ utility in the worst case; a Bayesian, on the other hand, will happily choose any non-zero value of ${a}$ since such an action gains ${\frac{N-2}{N}}$ utility in expectation. (I am purposely ignoring more complex ideas like mixed strategies for the purpose of illustration.).

Note that the frequentist optimization problem is more complicated than in the Bayesian case, since the value of (1) depends on the joint behavior of ${a(X_1,\ldots,X_n)}$, whereas with Bayes we can optimize ${a(X_1,\ldots,X_n)}$ for each set of observations separately.

As a result of this more complex optimization problem, it is often not actually possible to maximize (1), so many frequentist techniques instead develop tools to lower-bound (1) for a given decision procedure, and then try to construct a decision procedure that is reasonably close to the optimum. Support vector machines [2], which try to pick separating hyperplanes that minimize generalization error, are one example of this where the algorithm is explicitly trying to maximize worst-case utility. Another example of a frequentist decision procedure is L1-regularized least squares for sparse recovery [3], where the procedure itself does not look like it is explicitly maximizing any utility function, but a separate analysis shows that it is close to the optimal procedure anyways.

The second sort of frequentist approach to statistics is what I call a frequentist guarantee. A frequentist guarantee on an algorithm is a guarantee that, with high probability with respect to how the data was generated, the output of the algorithm will satisfy a given property. The most familiar example of this is any algorithm that generates a frequentist confidence interval: to generate a 95% frequentist confidence interval for a parameter $\theta$ is to run an algorithm that outputs an interval, such that with probability at least 95% $\theta$ lies within the interval. An important fact about most such algorithms is that the size of the interval only grows logarithmically with the amount of confidence we require, so getting a 99.9999% confidence interval is only slightly harder than getting a 95% confidence interval (and we should probably be asking for the former whenever possible).

If we use such algorithms to test hypotheses or to test discrete properties of $\theta$, then we can obtain algorithms that take in probabilistically generated data and produce an output that with high probability depends only on how the data was generated, not on the specific random samples that were given. For instance, we can create an algorithm that takes in samples from two distributions, and is guaranteed to output 1 whenever they are the same, 0 whenever they differ by at least ${\epsilon}$ in total variational distance, and could have arbitrary output if they are different but the total variational distance is less than ${\epsilon}$. This is an amazing property — it takes in random input and produces an essentially deterministic answer.

Finally, a third type of frequentist approach seeks to construct analysis tools for understanding the behavior of random variables. Metric entropy, the Chernoff and Azuma-Hoeffding bounds [12], and Doob’s optional stopping theorem are representative examples of this sort of approach. Arguably, everyone with the time to spare should master these techniques, since being able to analyze random variables is important no matter what approach to statistics you take. Indeed, frequentist analysis tools have no conflict at all with Bayesian methods — they simply provide techniques for understanding the behavior of the Bayesian model.

2. Bayes vs. Other Methods

2.1. Justification for Bayes

We presented Bayesian decision theory above, but are there any reasons why we should actually use it? One commonly-given reason is that Bayesian statistics is merely the application of Bayes’ Theorem, which, being a theorem, describes the only correct way to update beliefs in response to new evidence; anything else can only be justified to the extent that it provides a good approximation to Bayesian updating. This may be true, but Bayes’ Theorem only applies if we already have a prior, and if we accept probability as the correct framework for expressing uncertain beliefs. We might want to avoid one or both of these assumptions. Bayes’ theorem also doesn’t explain why we care about expected utility as opposed to some other statistic of the distribution over utilities (although note that frequentist decision theory also tries to maximize expected utility).

One compelling answer to this is Cox’s Theorem, which shows that any agent must implicitly be using a probability model to make decisions, or else they can be dutch-booked — meaning there is a series of bets that they would be willing to make that causes them to lose money with certainty. Another answer is the complete class theorem, which shows that any non-Bayesian decision procedure is strictly dominated by a Bayesian decision procedure — meaning that the Bayesian procedure performs at least as well as the non-Bayesian procedure in all cases with certainty. In other words, if you are doing anything non-Bayesian, then either it is secretly a Bayesian procedure or there is another procedure that does strictly better than it. Finally, the VNM Utility Theorem states that any agent with consistent preferences over distributions of outcomes must be implicitly maximizing the expected value of some scalar-valued function, which we can then use as our choice of utility function ${U}$. These theorems, however, ignore the issue of computation — while the best decision procedure may be Bayesian, the best computationally-efficient decision procedure could easily be non-Bayesian.

Another justification for Bayes is that, in contrast to ad hoc frequentist techniques, it actually provides a general theory for constructing statistical algorithms, as well as for incorporating side information such as expert knowledge. Indeed, when trying to model complex and highly structured situations it is difficult to obtain any sort of frequentist guarantees (although analysis tools can still often be applied to gain intuition about parts of the model). A prior lets us write down the sorts of models that would allow us to capture structured situations (for instance, when trying to do language modeling or transfer learning). Non-Bayesian methods exist for these situations, but they are often ad hoc and in many cases ends up looking like an approximation to Bayes. One example of this is Kneser-Ney smoothing for n-gram models, an ad hoc algorithm that ended up being very similar to an approximate inference algorithm for the hierarchical Pitman-Yor process [15, 14, 17, 8]. This raises another important point against Bayes, which is that the proper Bayesian interpretation may be very mathematically complex. Pitman-Yor processes are on the cutting-edge of Bayesian nonparametric statistics, which is itself one of the more technical subfields of statistical machine learning, so it was probably much easier to come up with Kneser-Ney smoothing than to find the interpretation in terms of Pitman-Yor processes.

2.2. When the Justifications Fail

The first and most common objection to Bayes is that a Bayesian method is only as good as its prior. While for simple models the performance of Bayes is relatively independent of the prior, such models can only capture data where frequentist techniques would also perform very well. For more complex (especially nonparametric) Bayesian models, the performance can depend strongly on the prior, and designing good priors is still an open problem. As one example I point to my own research on hierarchical nonparametric models, where the most straightforward attempts to build a hierarchical model lead to severe pathologies [13].

Even if a Bayesian model does have a good prior, it may be computationally intractable to perform posterior inference. For instance, structure learning in Bayesian networks is NP-hard [4], as is topic inference in the popular latent Dirichlet allocation model (and this continues to hold even if we only want to perform approximate inference). Similar stories probably hold for other common models, although a theoretical survey has yet to be made; suffice to say that in practice approximate inference remains a difficult and unsolved problem, with many models not even considered because of the apparent hopelessness of performing inference in them.

Because frequentist methods often come with an analysis of the specific algorithm being employed, they can sometimes overcome these computational issues. One example of this mentioned already is L1 regularized least squares [3]. The problem setup is that we have a linear regression task ${Ax = b+v}$ where ${A}$ and ${b}$ are known, ${v}$ is a noise vector, and ${x}$ is believed to be sparse (typically ${x}$ has many more rows than ${b}$, so without the sparsity assumption ${x}$ would be underdetermined). Let us suppose that ${x}$ has ${n}$ rows and ${k}$ non-zero rows — then the number of possible sparsity patterns is ${\binom{n}{k}}$ — large enough that a brute force consideration of all possible sparsity patterns is intractable. However, we can show that solving a certain semidefinite program will with high probability yield the appropriate sparsity pattern, after which recovering x reduces to a simple least squares problem. (A semidefinite program is a certain type of optimization problem that can be solved efficiently [16].)

Finally, Bayes has no good way of dealing with adversaries or with cases where the data was generated in a complicated way that could make it highly biased (for instance, as the output of an optimization procedure). A toy example of an adversary would be playing rock-paper-scissors — how should a Bayesian play such a game? The straightforward answer is to build up a model of the opponent based on their plays so far, and then to make the play that maximizes the expected score (probability of winning minus probability of losing). However, such a strategy fares poorly against any opponent with access to the model being used, as they can then just run the model themselves to predict the Bayesian’s plays in advance, thereby winning every single time. In contrast, there is a frequentist strategy called the multiplicative weights update method that fairs well against an arbitrary opponent (even one with superior computational resources and access to our agent’s source code). The multiplicative weights method does far more than winning at rock-paper-scissors — it is also a key component of the fastest algorithm for solving many important optimization problems (including the network flow algorithm), and it forms the theoretical basis for the widely used AdaBoost algorithm [1, 5, 7].

2.3. When To Use Each Method

The essential difference between Bayesian and frequentist decision theory is that Bayes makes the additional assumption of a prior over $\theta$, and optimizes for average-case performance rather than worst-case performance. It follows, then, that Bayes is the superior method whenever we can obtain a good prior and when good average-case performance is sufficient. However, if we have no way of obtaining a good prior, or when we need guaranteed performance, frequentist methods are the way to go. For instance, if we are trying to build a software package that should be widely deployable, we might want to use a frequentist method because users can be sure that the software will work as long as some number of easily-checkable assumptions are met.

A nice middle-ground between purely Bayesian and purely frequentist methods is to use a Bayesian model coupled with frequentist model-checking techniques; this gives us the freedom in modeling afforded by a prior but also gives us some degree of confidence that our model is correct. This approach is suggested by both Gelman [9] and Jordan [10].

3. Conclusion

When the assumptions of Bayes’ Theorem hold, and when Bayesian updating can be performed computationally efficiently, then it is indeed tautological that Bayes is the optimal approach. Even when some of these assumptions fail, Bayes can still be a fruitful approach. However, by working under weaker (sometimes even adversarial) assumptions, frequentist approaches can perform well in very complicated domains even with fairly simple models; this is because, with fewer assumptions being made at the outset, less work has to be done to ensure that those assumptions are met.

From a research perspective, we should be far from satisfied with either approach — Bayesian methods make stronger assumptions than may be warranted, and frequentists methods provide little in the way of a coherent framework for constructing models, and ask for worst-case guarantees, which probably cannot be obtained in general. We should seek to develop a statistical modeling framework that, unlike Bayes, can deal with unknown priors, adversaries, and limited computational resources.

4. Acknowledgements

Thanks to Emma Pierson, Vladimir Slepnev, and Wei Dai for reading preliminary versions of this work and providing many helpful comments.

5. References

[1] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta algorithm and applications. Working Paper, 2005.

[2] Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998.

[3] Emmanuel J. Candes. Compressive sampling. In Proceedings of the International Congress of Mathematicians. European Mathematical Society, 2006.

[4] D.M. Chickering. Learning bayesian networks is NP-complete. LECTURE NOTES IN STATISTICS-NEW YORK-SPRINGER VERLAG-, pages 121–130, 1996.

[5] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel Spielman, and Shang-Hua Teng. Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In Proceedings of the 43rd ACM Symposium on Theory of Computing, 2011.

[6] Andrew Critch. Frequentist vs. bayesian breakdown: Interpretation vs. inference. http://lesswrong.com/lw/7ck/frequentist_vs_bayesian_breakdown_interpretation/.

[7] Yoav Freund and Robert E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771–780, Sep. 1999.

[8] J. Gasthaus and Y.W. Teh. Improvements to the sequence memoizer. In Advances in Neural Information Processing Systems, 2011.

[9] Andrew Gelman. Induction and deduction in bayesian data analysis. RMM, 2:67–78, 2011.

[10] Michael I. Jordan. Are you a bayesian or a frequentist? Machine Learning Summer School 2009 (video lecture at http://videolectures.net/mlss09uk_jordan_bfway/).

[11] D. Warner North. A tutorial introduction to decision theory. IEEE Transactions on Systems Science and Cybernetics, SSC-4(3):200–210, Sep. 1968.

[12] Igal Sason. On refined versions of the Azuma-Hoeffding inequality with applications in information theory. CoRR, abs/1111.1977, 2011.

[13] Jacob Steinhardt and Zoubin Ghahramani. Pathological properties of deep bayesian hierarchies. In NIPS Workshop on Bayesian Nonparametrics, 2011. Extended Abstract.

[14] Y.W. Teh. A bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS, 2006.

[15] Y.W. Teh. A hierarchical bayesian language model based on pitman-yor processes. Coling/ACL, 2006.

[16] Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, Mar. 1996.

[17] F.~Wood, C.~Archambeau, J.~Gasthaus, L.~James, and Y.W. Teh. A stochastic memoizer for sequence data. In Proceedings of the 26th International Conference on Machine Learning, pages 1129–1136, 2009.

Verifying Stability of Stochastic Systems

2011-07-03T00:00:00-07:00

I just finished presenting my recent paper on stochastic verification at RSS 2011. There is a conference version online, with a journal article to come later. In this post I want to go over the problem statement and my solution.

Problem Statement

Abstractly, the goal is to be given some sort of description of a system, and of a goal for that system, and then verify that the system will reach that goal. The difference between our work and a lot (but not all) of the previous work is that we want to work with an explicit noise model for the system. So, for instance, I tell you that the system satisfies

$dx(t) = f(x) dt + g(x) dw(t),$

where $f(x)$ represents the nominal dynamics of the system, $g(x)$ represents how noise enters the system, and $dw(t)$ is a standard Wiener process (the continuous-time version of Gaussian noise). I would like to, for instance, verify that $h(x(T)) < 0$ for some function $h$ and some final time $T$. For example, if $x$ is one-dimensional then I could ask that $x(10)^2-1 < 0$, which is asking for $x$ to be within a distance of $1$ of the origin at time $10$. For now, I will focus on time-invariant systems and stability conditions. This means that $f$ and $g$ are not functions of $t$, and the condition we want to verify is that $h(x(t)) < 0$ for all $t \in [0,T]$. However, it is not too difficult to extend these ideas to the time-varying case, as I will show in the results at the end.

The tool we will use for our task is a supermartingale, which allows us to prove bounds on the probability that a system leaves a certain region.

Supermartingales

Let us suppose that I have a non-negative function $V$ of my state $x$ such that $\mathbb{E}[\dot{V}(x(t))] \leq c$ for all $x$ and $t$. Here we define $\mathbb{E}[\dot{V}(x(t))]$ as

$\lim\limits_{\Delta t \to 0^+} \frac{\mathbb{E}[V(x(t+\Delta t)) \mid x(t)]-V(x(t))}{\Delta t}.$

Then, just by integrating, we can see that $\mathbb{E}[V(x(t)) \mid x(0)] \leq V(x(0))+ct$. By Markov’s inequality, the probability that $V(x(t)) \geq \rho$ is at most $\frac{V(x(0))+ct}{\rho}$.

We can actually prove something stronger as follows: note that if we re-define our Markov process to stop evolving as soon as $V(x(t)) = \rho$, then this only sets $\mathbb{E}[\dot{V}]$ to zero in certain places. Thus the probability that $V(x(t)) \geq \rho$ for this new process is at most $\frac{V(x(0))+\max(c,0)t}{\rho}$. Since the process stops as soon as $V(x) \geq \rho$, we obtain the stronger result that the probability that $V(x(s)) \geq \rho$ for any $s \in [0,t]$ is at most $\frac{V(x(0))+\max(c,0)t}{\rho}$. Finally, we only need the condition $\mathbb{E}[\dot{V}] \leq c$ to hold when $V(x) < \rho$. We thus obtain the following:

Theorem. Let $V(x)$ be a non-negative function such that $\mathbb{E}[\dot{V}(x(t))] \leq c$ whenever $V(x) < \rho$. Then with probability at least $1-\frac{V(x(0))+\max(c,0)T}{\rho}$, $V(x(t)) < \rho$ for all $t \in [0,T)$.

We call the condition $\mathbb{E}[\dot{V}] \leq c$ the supermartingale condition, and a function that satisfies the martingale condition is called a supermartingale. If we can construct supermartingales for our system, then we can bound the probability that trajectories of the system leave a given region.

NOTE: for most people, a supermartingale is something that satisfies the condition $\mathbb{E}[\dot{V}] \leq 0$. However, this condition is often impossible to satisfy for systems we might care about. For instance, just consider exponential decay driven by Gaussian noise:

Once the system gets close enough to the origin, the exponential decay ceases to matter much and the system is basically just getting bounced around by the Gaussian noise. In particular, if the system is ever at the origin, it will get perturbed away again, so you cannot hope to find a non-constant function of $x$ that is decreasing in expectation everywhere (just consider the global minimum of such a function: in all cases, there is a non-zero probability that the Gaussian noise will cause $V(x)$ to increase, but a zero probability that $V(x)$ will decrease because we are already at the global minimum this argument doesn’t actually work, but I am pretty sure that my claim is true at least subject to sufficient technical conditions).

Applying the Martingale Theorem

Now that we have this theorem, we need some way to actually use it. First, let us try to get a more explicit version of the Martingale condition for the systems we are considering, which you will recall are of the form $dx(t) = f(x) dt + g(x) dw(t)$. Note that $V(x+\Delta x) = V(x) + \frac{\partial V}{\partial x} \Delta x + \frac{1}{2}Trace\left(\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3)$.

Then $\mathbb{E}[\dot{V}(x)] = \lim_{\Delta t \to 0^+} \frac{\frac{\partial V}{\partial x} \mathbb{E}[\Delta x]+\frac{1}{2}Trace\left(\mathbb{E}[\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3)}{\Delta t}$. A Wiener process satisfies $\mathbb{E}[dw(t)] = 0$ and $\mathbb{E}[dw(t)^2] = dt$, so only the nominal dynamics ($f(x)$) affect the limit of the first-order term while only the noise ($g(x)$) affects the limit of the second-order term (the third-order and higher terms in $\Delta x$ all go to zero). We thus end up with the formula

$\mathbb{E}[\dot{V}(x)] = \frac{\partial V}{\partial x}f(x)+\frac{1}{2}Trace\left(g(x)^T\frac{\partial^2 V}{\partial x^2}g(x)\right).$

It is not that difficult to construct a supermartingale, but most supermartingales that you construct will yield a pretty poor bound. To illustrate this, consider the system $dx(t) = -x dt + dw(t)$. This is the example in the image from the previous section. Now consider a quadratic function $V(x) = x^2$. The preceding formula tells us that $\mathbb{E}[\dot{V}(x)] = -2x^2+1$. We thus have $\mathbb{E}[\dot{V}(x)] \leq 1$, which means that the probability of leaving the region $x^2 < \rho$ is at most $\frac{x(0)^2+T}{\rho}$. This is not particularly impressive: it says that we should expect $x$ to grow roughly as $\sqrt{T}$, which is how quickly $x$ would grow if it was a random walk with no stabilizing component at all.

One way to deal with this is to have a state-dependent bound $\mathbb{E}[\dot{V}] \leq c-kV$. This has been considered for instance by Pham, Tabareau, and Slotine (see Lemma 2 and Theorem 2), but I am not sure whether their results still work if the supermartingale condition only holds locally instead of globally; I haven’t spent much time on this, so they could generalize quite trivially.

Another way to deal with this is to pick a more quickly-growing candidate supermartingale. For instance, we could pick $V(x) = x^4$. Then $\mathbb{E}[\dot{V}] = -4x^4+6x^2$, which has a global maximum of $\frac{9}{4}$ at $x = \frac{\sqrt{3}}{2}$. This bounds then says that $x$ grows at a rate of at most $T^{\frac{1}{4}}$, which is better than before, but still much worse than reality.

We could keep improving on this bound by considering successively faster-growing polynomials. However, automating such a process becomes expensive once the degree of the polynomial gets large. Instead, let’s consider a function like $V(x) = e^{0.5x^2}$. Then $\mathbb{E}[\dot{V}] = e^{0.5x^2}(0.5-0.5x^2)$, which has a maximum of 0.5 at x=0. Now our bound says that we should expect x to grow like $\sqrt{\log(T)}$, which is a much better growth rate (and roughly the true growth rate, at least in terms of the largest value of $|x|$ over the time interval $[0,T]$).

This leads us to our overall strategy for finding good supermartingales. We will search across functions of the form $V(x) = e^{x^TSx}$ where $S \succeq 0$ is a matrix (the $\succeq$ means “positive semidefinite”, which roughly means that the graph of the function $x^TSx$ looks like a bowl rather than a saddle/hyperbola). This begs two questions: how to upper-bound the global maximum of $\mathbb{E}[\dot{V}]$ for this family, and how to search efficiently over this family. The former is done by doing some careful work with inequalities, while the latter is done with semidefinite programming. I will explain both below.

Upper-bounding $\mathbb{E}[\dot{V}]$

In general, if $V(x) = e^{x^TSx}$, then $\mathbb{E}[\dot{V}(x)] = e^{x^TSx}\left(2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx\right)$. We would like to show that such a function is upper-bounded by a constant $c$. To do this, move the exponential term to the right-hand-side to get the equivalent condition $2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx \leq ce^{-x^TSx}$. Then we can lower-bound $e^{-x^TSx}$ by $1-x^TSx$ and obtain the sufficient condition

$c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))-2x^TSg(x)g(x)^TSx \geq 0.$

It is still not immediately clear how to check such a condition, but somehow the fact that this new condition only involves polynomials (assuming that f and g are polynomials) seems like it should make computations more tractable. This is indeed the case. While checking if a polynomial is positive is NP-hard, checking whether it is a sum of squares of other polynomials can be done in polynomial time. While sum of squares is not the same as positive, it is a sufficient condition (since the square of a real number is always positive).

The way we check whether a polynomial p(x) is a sum of squares is to formulate it as the semidefinite program: $p(x) = z^TMz, M \succeq 0$, where $z$ is a vector of monomials. The condition $p(x) = z^TMz$ is a set of affine constraints on the entries of $M$, so that the above program is indeed semidefinite and can be solved efficiently.

Efficiently searching across all matrices S

We can extend on the sum-of-squares idea in the previous section to search over $S$. Note that if $p$ is a parameterized polynomial whose coefficient are affine in a set of decision variables, then the condition $p(x) = z^TMz$ is again a set of affine constraints on $M$. This almost solves our problem for us, but not quite. The issue is the form of $p(x)$ in our case:

$c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))-2x^TSg(x)g(x)^TSx$

Do you see the problem? There are two places where the constraints do not appear linearly in the decision variables: $c$ and $S$ multiply each other in the first term, and $S$ appears quadratically in the last term. While the first non-linearity is not so bad ($c$ is a scalar so it is relatively cheap to search over $c$ exhaustively), the second non-linearity is more serious. Fortunately, we can resolve the issue with Schur complements. The idea behind Schur complements is that, assuming $A \succeq 0$, the condition $B^TA^{-1}B \preceq C$ is equivalent to $\left[ \begin{array}{cc} A & B \\ B^T & C \end{array} \right] \succeq 0$. In our case, this means that our condition is equivalent to the condition that

$\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right] \succeq 0$

where $I$ is the identity matrix. Now we have a condition that is linear in the decision variable $S$, but it is no longer a polynomial condition, it is a condition that a matrix polynomial be positive semidefinite. Fortunately, we can reduce this to a purely polynomial condition by creating a set of dummy variables $y$ and asking that

$y^T\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right]y \geq 0$

We can then do a line search over $c$ and solve a semidefinite program to determine a feasible value of $S$. If we care about remaining within a specific region, we can maximize $\rho$ such that $x^TSx < \rho$ implies that we stay in the region. Since our bound on the probability of leaving the grows roughly as $ce^{-\rho}T$, this is a pretty reasonable thing to maximize (we would actually want to maximize $\rho-\log(c)$, but this is a bit more difficult to do).

Oftentimes, for instance if we are verifying stability around a trajectory, we would like $c$ to be time-varying. In this case an exhaustive search is no longer feasible. Instead we alternate between searching over $S$ and searching over $c$. In the step where we search over $S$, we maximize $\rho$. In the step where we search over $c$, we maximize the amount by which we could change $c$ and still satisfy the constraints (the easiest way to do this is by first maximizing $c$, then minimizing $c$, then taking the average; the fact that semidefinite constraints are convex implies that this optimizes the margin on $c$ for a fixed $S$).

A final note is that systems are often only stable locally, and so we only want to check the constraint $\mathbb{E}[\dot{V}] \leq c$ in a region where $V(x) < \rho$. We can do this by adding a Lagrange multiplier to our constraints. For instance, if we want to check that $p(x) \geq 0$ whenever $(x) \leq 0$, it suffices to find a polynomial $\lambda(x)$ such that $\lambda(x) \geq 0$ and $p(x)+\lambda(x)s(x) \geq 0$. (You should convince yourself that this is true; the easiest proof is just by casework on the sign of $s(x)$.) This again introduces a non-linearity in the constraints, but if we fix $S$ and $\rho$ then the constraints are linear in $c$ and $\lambda$, and vice-versa, so we can perform the same alternating maximization as before.

Results

Below is the most exciting result, it is for an airplane with a noisy camera trying to avoid obstacles. Using the verification methods above, we can show that with probability at least $0.99$ that the plane trajectory will not leave the gray region:

Useful Math

2011-01-23T00:00:00-08:00

I have spent the last several months doing applied math, culminating in a submission of a paper to a robotics conference (although culminating might be the wrong word, since I’m still working on the project).

Unfortunately the review process is double-blind so I can’t talk about that specifically, but I’m more interested in going over the math I ended up using (not expositing on it, just making a list, more or less). This is meant to be a moderate amount of empirical evidence for which pieces of math are actually useful, and which aren’t (of course, the lack of appearance on this list doesn’t imply uselessness, but should be taken as Bayesian evidence against usefulness).

I’ll start with the stuff that I actually used in the paper, then stuff that helped me formulate the ideas in the paper, then stuff that I’ve used in other work that hasn’t yet come to fruition. These will be labelled I, II, and III below. Let me know if you think something should be in III that isn’t [in other words, you think there’s a piece of math that is useful but not listed here, preferably with the application you have in mind], or if you have better links to any of the topics below.

I. Ideas used directly

Optimization: semidefinite optimization, convex optimization, sum-of-squares programming, Schur complements, Lagrange multipliers, KKT conditions

Differential equations: Lyapunov functions, linear differential equations, Poincaré return map, exponential stability, Ito calculus

Linear algebra: matrix exponential, trace, determinant, Cholesky decomposition, plus general matrix manipulation and familiarity with eigenvalues and quadratic forms

Probability theory: Markov’s inequality, linearity of expectation, martingales, multivariate normal distribution, stochastic processes (Wiener process, Poisson process, Markov process, Lévy process, stopped process)

Multivariable calculus: partial derivative, full derivative, gradient, Hessian, Matrix calculus, Taylor expansion

II. Indirectly helpful ideas

Inequalities: Jensen’s inequality, testing critical points

Optimization: (non-convex) function minimization

III. Other useful ideas

Calculus: calculus of variations, extended binomial theorem

Function Approximation: variational approximation, neural networks

Graph Theory: random walks and relation to Markov Chains, Perron-Frobenius Theorem, combinatorial linear algebra, graphical models (Bayesian networks, Markov random fields, factor graphs)

Miscellaneous: Kullback-Leibler divergence, Riccati equation, homogeneity / dimensional analysis, AM-GM, induced maps (in a general algebraic sense, not just the homotopy sense; unfortunately I have no good link for this one)

Probability: Bayes’ rule, Dirichlet process, Beta and Bernoulli processes, details balance and Markov Chain Monte Carlo

Spectral analysis: Fourier transform, windowing, aliasing, wavelets, Pontryagin duality

Linear algebra: change of basis, Schur complement, adjoints, kernels, injectivity/surjectivity/bijectivity of linear operators, natural transformations / characteristic subspaces

Topology: compactness, open/closed sets, dense sets, continuity, uniform continuity, connectedness, path-connectedness

Analysis: Lipschitz continuity, Lesbesgue measure, Haar measure, manifolds, algebraic manifolds

Optimization: quasiconvexity

Generalizing Across Categories

2010-10-02T00:00:00-07:00

Humans are very good at correctly generalizing rules across categories (at least, compared to computers). In this post I will examine mechanisms that would allow us to do this in a reasonably rigorous manner. To this end I will present a probabilistic model such that conditional inference on that model leads to generalization across a category.

There are three questions along these lines that I hope to answer:

How does one generalize rules across categories?
How does one determine which rules should generalize across which categories?
How does one determine when to group objects into a category in the first place?

I suspect that the mechanisms for each of these is rather complex, but I am reasonably confident that the methods I present make up at least part of the actual answer. A good exercise is to come up with examples where these methods fail.

Generalizing Across Categories

For simplicity I’m going to consider just some sort of binary rule, such as the existence of an attribute. So as an example, let’s suppose that we see a bunch of ducks, which look to varying degrees to be mallards. In addition to this, we notice that some of the ducks have three toes, and some have four toes. In this case the category is “mallards”, and the attribute is “has three toes”.

A category is going to be represented as a probabilistic relation for each potential member, of the form $p(x \in C)$, where $C$ is the category and $x$ is the object in question. This essentially indicates the degree to which the object $x$ belongs to the category $C$. For instance, if the category is “birds”, then pigeons, sparrows, and eagles all fit into the category very well, so we assign a high probability (close to 1) to pigeons, sparrows, and eagles belonging in the category “birds”. On the other hand, flamingos, penguins, ostriches, and pterodactyls, while still being bird-like, don’t necessarily fit into our archetypal notion of what it means to be a bird. So they get lower probabilities (I’d say around 0.8 for the first 3 and around 0.1 for pterodactyls, but that is all pretty subjective). Finally, dogs and cats get a near-zero probability of being birds, and a table, the plus operator, or Beethoven’s Ninth Symphony would get even-closer-to-zero probabilities of being a bird.

In the example with ducks, the probability of being a mallard will probably be based on how many observed characteristics the duck in question has in common with mallards.

For now we will think of a rule or an attribute as an observable binary property about various objects. Let’s let $P$ be the set of all things that have this property. Now we assume that membership in $P$ has some base probability of occurrence $\theta$, together with some probability of occurrence within $C$ of $\theta_C$. We will assume that there are objects $x_i$, $y_j$, and $z_k$, such that the $x_i$ were all observed to lie in $P$, the $y_j$ were all observed to not lie in $P$, and membership in $P$ has been unobserved for the $z_k$.

Furthermore, $x_i$ has probability $p_i$ of belonging to $C$, $y_j$ has probability $q_j$ of belonging to $C$, and $z_k$ has probability $r_k$ of belonging to $C$.

Again, going back to the ducks example, the $x_i$ are the ducks observed to have three toes, the $y_j$ are the ducks observed to have four toes, and the $z_k$ are the ducks whose number of toes we have not yet observed. The $p_i$, $q_j$, and $r_k$ are the aforementioned probabilities that each of these ducks is a mallard.

So to summarize, we have three classes of objects — those which certainly lie in P, those which certainly don’t lie in P, and those for which we have no information about P. For each element in each of these classes, we have a measure of the extent to which it belongs in C.

Given this setup, how do we actually go about specifying a model explicitly? Basically, we say that if something lies in $C$ then it has property $P$ with probability $\theta_C$, and otherwise it has it with probability $\theta$. Thus we end up with

$p(\theta, \theta_C \ \mid \ p_i, q_j) \propto p(\theta)p(\theta_C) \left(\prod_{i} p_i\theta_C+(1-p_i)\theta\right)\left(\prod_{j} q_j(1-\theta_C)+(1-q_j)(1-\theta)\right)$

What this ends up meaning is that $\theta$ will shift to accommodate the objects that (most likely) don’t lie in $C$, and $\theta_C$ will shift to accommodate the objects that (most likely) do lie in $C$. At the same time, if for instance most elements in $C$ also lie in $P$, then the elements that don’t lie in $P$ will have a relatively lower posterior probability of lying in $C$ (compared to its prior probability). So this model not only has the advantage of generalizing across all of $C$ based on the given observations, it also has the advantage of re-evaluating whether an object is likely to lie in a category based on whether it shares attributes with the other objects in that category.

Let’s go back to the mallards example one more time. Suppose that after we make all of our observations, we notice that most of the ducks that we think are mallards also have three toes. Then $\theta_C$ in this case (the probability that a mallard has three toes) will be close to 1. Furthermore, any duck that we observe to have four toes, we will think much less likely to be a mallard, even if it otherwise looks similar (although it wouldn’t be impossible; it could be a genetic mutant, for instance). At the same time, if non-mallards seem to have either three or four toes with roughly equal frequency, then $\theta$ will be close to 0.5.

There is one thing that I am dissatisfied with in the above model, though. As it stands, $\theta$ measures the probability of an object not lying in $C$ to have property $P$, rather than the probability of a generic object to have property $P$. This is mainly a problem because, later on, I would like to be able to talk about an object lying in multiple categories, and I don’t have a good way of doing that yet.

An important thing to realize here is that satisfying a rule or having an attribute is just another way of indicating membership in a set. So we can think of both the category and the attribute as potential sets that an object could lie in; as before, we’ll call the category $C$ and we will call the set of objects having a given attribute $A$. Then $\theta_C$ is something like $p(x \in A \ \mid \ x \in C)$, whereas $\theta$ (should be) $p(x \in A)$, although as I’ve set it up now it’s more like $p(x \in A \ \mid \ x \not\in C)$.

As a final note, this same setup can be applied to the case when there are multiple attributes under consideration. We can also apply it to the case when the objects are themselves categories, and so instead of having strict observations about each attribute, we have some sort of probability that the object will possess an attribute. In this latter case we just treat these probabilities as uncertain observations, as discussed previously.

When To Generalize

Another important question is which rules / attributes we should expect to generalize across a category, and which we should not. For instance, if we would expect “number of toes” to generalize across all animals in a given species, but not “age”.

I think that for binary attributes we will always have to generalize, although our generalization could just be “occurs at the base rate in the population”. However, for more complicated attributes (for instance, age, which has an entire continuum of possible values), we can think about whether to generalize based on whether our posterior distribution ends up very tightly concentrated or very spread out. Even if it ends up fairly spread out, there is still some degree of generalization — for instance, we expect that most insects have fairly short lifespans compared to mammals. You could say that this is actually a scientific fact (since we can measure lifespan), but even if you ran into a completely new species of insect, you would expect it to have a relatively short lifespan.

On the other hand, if we went to a college town, we might quickly infer that everyone there was in their 20s; if we went to a bridge club, we would quickly infer that everyone there was past retirement age.

What I would say is that we can always generalize across a category, but sometimes the posterior distribution we end up with is very spread out, and so our generalization isn’t particularly strong. In some cases, the posterior distribution might not even be usefully different from the prior distribution; in this case, from a computational standpoint it doesn’t make sense to keep track of it, and so it feels like an attribute shouldn’t generalize across a category, while what we really mean is that we don’t bother to keep track of that generalization.

An important thing to keep in mind on this topic is that everything that we do when we construct models is a computational heuristic. If we didn’t care about computational complexity and only wanted to arrive at an answer, we would use Solomonoff induction, or some computable approximation to it. So whenever I talk about something in probabilistic modeling, I’m partially thinking about whether the method is computationally feasible at all, and what sort of heuristics might be necessary to actually implement something in practice.

To summarize:

attributes always generalize across a category
but the generalization might be so weak as to be indistinguishable from the prior distribution
it would be interesting to figure out how we decide which generalizations to keep track of, and which not to

Forming Categories

My final topic of consideration is when to actually form a category. Before going into this, I’d like to share some intuition with you. This is the intuition of explanatory “cheapness” or “expensiveness”. The idea is that events that have low probability under a given hypothesis are “expensive”, and events that have relatively high probability are “cheap”. When a fixed event is expensive under a model, that is evidence against that model; when it is cheap under a model, that is evidence for the model.

The reason that forming clusters can be cheaper is that it allows us to explain many related events simultaneously, which is cheaper than explaining them all separately. Let’s take an extreme example — suppose that 70% of all ducks have long beaks, and 70% have three toes (the remaining 30% in both cases have short beaks and four toes). Then the “cheapest” explanation would be to assign both long beaks and three toes a probability of 0.7, and the probability of our observations would be (supposing there were 100 ducks) $0.7^{140}0.3^{60} \approx 10^{-53}$. However, suppose that we also observe that the ducks with long beaks are exactly the ducks with three toes. Then we can instead postulate that there are two categories, “(long,3)” and “(short,4)”, and that the probability of ending up in the first category is 0.7 while the probability of ending up in the second category is 0.3. This is a much cheaper explanation, as the probability of our observations in this scenario becomes $0.7^{70}0.3^{30} \approx 10^{-27}$. We have to balance this slightly with the fact that an explanation that involves creating an extra category is more expensive (because it has a lower prior probability / higher complexity) than one that doesn’t, but the extra category can’t possibly be $10^{26}$ times more expensive, so we should still always favor it.

This also demonstrates that it will become progressively cheaper to form categories as we get large amounts of similar evidence, which corresponds with our intuitive notion of categories as similarity clusters.

So, now that I’ve actually given you this intuition, how do we actually go about forming categories? Each time we form a category, we will have some base probability of membership in that category, together with a probability of each member of that category possessing each attribute under consideration. If we simply pick prior distributions for each of these parameters, then they will naturally adopt posterior distributions based on the observed data, although these distributions might have multiple high-probability regions if multiple categories can be formed from the data. At the same time, each object will naturally end up with some probability of being in each category, based on how well that category explains its characteristics (as well as the base probability that objects should end up in that category). Putting these together, we see that clusters will naturally form to accommodate the data.

As an example of how we would have multiple high-probability regions, suppose that, as before, there are long-billed, three-toed ducks and short-billed, four-toed ducks. But we also notice that ducks with orange beaks have white feathers and ducks with yellow beaks have brown feathers. If we form a single category, then there are two good candidates, so that category is likely to be either the (long,3) category or the (orange,white) category (or their complements). If we form two categories, each category in isolation is still likely to be either the (long,3) or (orange,white) category, although if the first category is (long,3), then the second is very likely to be (orange,white), and vice versa. In other words, it would be silly to make both categories (long,3), or both categories (orange,white).

The only thing that is left to do is to specify some prior probability of forming a category. While there are various ways to do this, the most commonly used way is the Indian Buffet Process. I won’t explain it in detail here, but I might explain it later.

Future Considerations

There are still some unresolved questions here. First of all, in reality something has the potential to be in many, many categories, and it is not entirely clear how to resolve such issues in the above framework. Secondly, keeping track of all of our observations about a given category can be quite difficult computationally (updating in the above scenarios requires performing computations on high-degree polynomials), so we need efficient algorithms for dealing with all of the data we get.

I’m not sure yet how to deal with the first issue, although I’ll be thinking about it in the coming weeks. To deal with the second issue, I intend to use an approach based on streaming algorithms, which will hopefully make up a good final project for a class I’m taking this semester (Indyk and Rubinfeld’s class on Sublinear Algorithms, if you know about MIT course offerings).

Uncertain Observations

2010-09-18T00:00:00-07:00

What happens when you are uncertain about observations you made? For instance, you remember something happening, but you don’t remember who did it. Or you remember some fact you read on wikipedia, but you don’t know whether it said that hydrogen or helium was used in some chemical process.

How do we take this information into account in the context of Bayes’ rule? First, I’d like to note that there are different ways something could be uncertain. It could be that you observed X, but you don’t remember if it was in state A or state B. Or it could be that you think you observed X in state A, but you aren’t sure.

These are different because in the first case you don’t know whether to concentrate probability mass towards A or B, whereas in the second case you don’t know whether to concentrate probability mass at all.

Fortunately, both cases are pretty straightforward as long as you are careful about using Bayes’ rule. However, today I am going to focus on the latter case. In fact, I will restrict my attention to the following problem:

You have a coin that has some probability $\pi$ of coming up heads. You also know that all flips of this coin are independent. But you don’t know what $\pi$ is. However, you have observed this coin $n$ times in the past. But for each observation, you aren’t completely sure that this was the coin you were observing. In particular, you only assign a probability $r_i$ to your $i$th observation actually being about this coin. Given this, and the sequence of heads and tails you remember, what is your estimate of $\pi?$

To use Bayes’ rule, let’s first figure out what we need to condition on. In this case, we need to condition on remembering the sequence of coin flips that we remembered. So we are looking for

p($\pi = \theta$

we remember the given sequence of flips),

which is proportional to

p(we remember the given sequence of flips

$\pi = \theta$) $\cdot$ p($\pi = \theta$).

The only thing that the uncertain nature of our observations does is cause there to be multiple ways to eventually land in the set of universes where we remember the sequence of flips; in particular, for any observation we remember, it could have actually happened, or we could have incorrectly remembered it. Thus if $\pi = \theta$, and we remember the $i$th coin flip as being heads, then this could happen with probability $1-r_i$ if we incorrectly remembered a coin flip of heads. In the remaining probability $r_i$ case, it could happen with probability $\theta$ by actually coming up heads. Therefore the probability of us remembering that the $i$th flip was heads is $(1-r_i)+r_i \theta$.

A similar computation shows that the probability of us remembering that the $i$th flip was tails is $(1-r_i)+r_i(1-\theta) = 1-r_i\theta$.

For convenience of notation, let’s actually split up our remembered flips into those that were heads and those that were tails. The probability of the $i$th remembered heads being real is $h_i$, and the probability of the $j$th remembered tails being real is $t_i$. There are $m$ heads and $n$ tails. Then we get

$p(\pi = \theta

\mathrm{\ our \ memory}) \propto p(\pi = \theta) \cdot \left(\prod_{i=1}^m (1-h_i)+h_i\theta \right) \cdot \left(\prod_{i=1}^n 1-t_i\theta\right)$.

Note that when we consider values of $\theta$ close to $0$, the term from the remembered tails becomes close to $1-\theta$ raised to the power of the expected number of tails, whereas the term from the remembered heads becomes close to the probability that we incorrectly remembered each of the heads. A similar phenomenon will happen when $\theta$ gets close to $1$. This is an instance of a more general phenomenon whereby unlikely observations get “explained away” by whatever means possible.

A Caveat

Applying the above model in practice can be quite tricky. The reason is that your memories are intimately tied to all sorts of events that happen to you; in particular, your assessment of how likely you are to remember an event probably already takes into account how well that event fits into your existing model. So if you saw 100 heads, and then a tails, you would place more weight than normal on your recollection of the tails being incorrect, even though that is the job of the above model. In essence, you are conditioning on your data twice — once intuitively, and once as part of the model. This is bad because it assumes that you made each observation twice as many times as you actually did.

What is interesting, though, is that you can actually compute things like the probability that you incorrectly remembered an event, given the rest of the data, and it will be different from the prior probability. So in addition to a posterior estimate of $\pi$, you get posterior estimates of the likelihood of each of your recollections. Just be careful not to take these posterior estimates and use them as if they were prior estimates (which, as explained above, is what we are likely to do intuitively).

There are other issues to using this in practice, as well. For instance, if you really want the coin to be fair, or unfair, or be biased in a certain direction, it is very easy to fool yourself into assigning skewed probability estimates towards each of your recollections, thus ending up with a biased answer at the end. It’s not even difficult — if I take a fair coin, and underestimate my recollection of each tails by 20%, and overestimate my recollection of each heads by 20%, then all of a sudden I “have a coin” that is 50% more likely to come up heads than tails.

Fortunately, my intended application of this model will be in a less slippery domain (hopefully). The purpose is to finally answer the question I posed in the last post, which I’ll repeat here for convenience:

Suppose that you have never played a sport before, and you play soccer, and enjoy it. Now suppose instead that you have never played a sport before, and play soccer, and hate it. In the first case, you will think yourself more likely to enjoy other sports in the future, relative to in the second case. Why is this?

Or if you disagree with the premises of the above scenario, simply “If X and Y belong to the same category C, why is it that in certain cases we think it more likely that Y will have attribute A upon observing that X has attribute A?”

In the interest of making my posts shorter, I will leave that until next time, but hopefully I’ll get to it in the next week.

Nobody Understands Probability

2010-09-13T00:00:00-07:00

The goal of this post is to give an overview of Bayesian statistics as well as to correct errors about probability that even mathematically sophisticated people commonly make. Hopefully by the end of this post I will convince you that you don’t actually understand probability theory as well as you think, and that probability itself is something worth thinking about.

I will try to make this post somewhat shorter than the previous posts. As a result, this will be only the first of several posts on probability. Even though this post will be shorter, I will summarize its organization below:

Bayes’ theorem: the fundamentals of conditional probability
modeling your sources: how not to calculate conditional probabilities; the difference between “you are given X” and “you are given that you are given X”
how to build models: examples using toy problems
probabilities are statements about your beliefs (not the world)
re-evaluating a standard statistical test

I bolded the section on models because I think it is very important, so I hope that bolding it will make you more likely to read it.

Also, I should note that when I say that nobody understands probability, I don’t mean it in the sense that most people are bad at combinatorics. Indeed, I expect that most of the readers of this blog are quite proficient at combinatorics, and that many of them even have sophisticated mathematical definitions of probability. Rather I would say that actually using probability theory in practice is non-trivial. This is partially because there are some subtleties (or at least, I have found myself tripped up by certain points, and did not realize this until much later). It is also because whenever you use probability theory in practice, you end up employing various heuristics, and it’s not clear which ones are the “right” ones.

If you disagree with me, and think that everything about probability is trivial, then I challenge you to come up with a probability-theoretic explanation of the following phenomenon:

Suppose that you have never played a sport before, and you play soccer, and enjoy it. Now suppose instead that you have never played a sport before, and play soccer, and hate it. In the first case, you will think yourself more likely to enjoy other sports in the future, relative to in the second case. Why is this?

Or if you disagree with the premises of the above scenario, simply “If X and Y belong to the same category C, why is it that in certain cases we think it more likely that Y will have attribute A upon observing that X has attribute A?”

Bayes’ Theorem

Bayes’ theorem is a fundamental result about conditional probability. It says the following:

$p(A \mid B) = \frac{p(B \mid A)p(A)}{p(B)}$

Here $A$ and $B$ are two events, and $p(A \mid B)$ means the probability of $A$ conditioned on $B$. In other words, if we already know that $B$ occurred, what is the probability of $A$? The above theorem is quite easy to prove, using the fact that $p(A \cap B) = p(A \mid B)p(B)$, and thus also equals $p(B \mid A)p(A)$, so that $p(A \mid B)p(B) = p(B \mid A)p(A)$, which implies Bayes’ theorem. So, why is it useful, and how do we use it?

One example is the following famous problem: A doctor has a test for a disease that is 99% accurate. In other words, it has a 1% chance of telling you that you have a disease even if you don’t, and it has a 1% chance of telling you that you don’t have a disease even if you do. Now suppose that the disease that this tests for is extremely rare, and only affects 1 in 1,000,000 people. If the doctor performs the test on you, and it comes up positive, how likely are you to have the disease?

The answer is close to $10^{-4}$, since it is roughly $10^4$ times as likely for the test to come up positive due to an error in the test relative to you actually having the disease. To actually compute this with Bayes’ rule, you can say

p(Disease

Test is positive) = p(Test is positive

Disease)p(Disease)/p(Test is positive),

which comes out to $\frac{0.99 \cdot 10^{-6}}{0.01 \cdot (1-10^{-6}) + 0.99 \cdot 10^{-6}},$ which is quite close to $10^{-4}$.

In general, we can use Bayes’ law to test hypotheses:

p(Hypothesis

Data) = p(Data

Hypothesis) p(Hypothesis) / p(Data)

Let’s consider each of these terms separately:

p(Hypothesis Data) — the weight we assign to a given hypothesis being correct under our observed data

p(Data

Hypothesis) — the likelihood of seeing the data we saw under our hypothesis; note that this should be quite easy to compute. If it isn’t, then we haven’t yet fully specified our hypothesis.

p(Hypothesis) — the prior weight we give to our hypothesis. This is subjective, but should intuitively be informed by the consideration that “simpler hypotheses are better”.
p(Data) — how likely we are to see the data in the first place. This is quite hard to compute, as it involves considering all possible hypotheses, how likely each of those hypotheses is to be correct, and how likely the data is to occur under each hypothesis.

So, we have an expression for p(Hypothesis

Data), one of which is easy to compute, the other of which can be chosen subjectively, and the last of which is hard to compute. How do we get around the fact that p(Data) is hard to compute? Note that p(Data) is independent of which hypothesis we are testing, so Bayes’ theorem actually gives us a very good way for comparing the relative merits of two hypotheses:

p(Hypothesis 1

Data) / p(Hypothesis 2

Data) = [p(Data

Hypothesis 1) / p(Data

Hypothesis 2)] $\times$ p(Hypothesis 1) / p(Hypothesis 2)

Let’s consider the following toy example. There is a stream of digits going past us, too fast for us to tell what the numbers are. But we are allowed to push a button that will stop the stream and allow us to see a single number (whichever one is currently in front of us). We push this button three times, and see the numbers 3, 5, and 3. How many different numbers would we estimate are in the stream?

For simplicity, we will make the (somewhat unnatural) assumption that each number between 0 and 9 is selected to be in the stream with probability 0.5, and that each digit in the stream is chosen uniformly from the set of selected numbers. It is worth noting now that making this assumption, rather than some other assumption, will change our final answer.

Now under this assumption, what is the probability, say, of there being exactly 2 numbers (3 and 5) in the stream? By Bayes’ theorem, we have

$p(\{3,5\} \mid (3,5,3)) \propto p((3,5,3) \mid \{3,5\}) p(\{3,5\}) = \left(\frac{1}{2}\right)^3 \left(\frac{1}{2}\right)^{10}.$

Here $\propto$ means “is proportional to”.

What about the probability of there being 3 numbers (3, 5, and some other number)? For any given other number, this would be

$p(\{3,5,x\} \mid (3,5,3)) \propto p((3,5,3) \mid \{3,5,x\}) p(\{3,5,x\}) = \left(\frac{1}{3}\right)^3 \left(\frac{1}{2}\right)^{10}.$

However, there are 8 possibilities for $x$ above, all of which correspond to disjoint scenarios, so the probability of there being 3 numbers is proportional to $9 \left(\frac{1}{3}\right)^3 \left(\frac{1}{2}\right)^{10}$. If we compare this to the probability of there being 2 numbers, we get

$p(2 \text{ numbers}

(3,5,3)) / p(3 \text{ numbers}

(3,5,3)) = 3/8$.

Even though we have only seen two numbers in our first three samples, we still think it is more likely that there are 3 numbers than 2, just because the prior likelihood of there being 3 numbers is so much higher. However, suppose that we made 6 draws, and they were 3,5,3,3,3,5. Then we would get

$p(2 \text{ numbers}

(3,5,3,3,5)) / p(3 \text{ numbers}

(3,5,3,3,5)) = (1/2)^6 / [9 \times (1/3)^6] = 81/64$.

Now we find it more likely that there are only 2 numbers. This is what tends to happen in general with Bayes’ rule — over time, more restrictive hypotheses become exponentially more likely than less restrictive hypotheses, provided that they correctly explain the data. Put another way, hypotheses that concentrate probability density towards the actual observed events will do best in the long run. This is a nice feature of Bayes’ rule because it means that, even if the prior you choose is not perfect, you can still arrive at the “correct” hypothesis through enough observations (provided that the hypothesis is among the set of hypotheses you consider).

I will use Bayes’ rule extensively through the rest of this post and the next few posts, so you should make sure that you understand it. If something is unclear, post a comment and I will try to explain in more detail.

Model Your Sources

An important distinction that I think most people don’t think about is the difference between experiments you perform, and experiments you observe. To illustrate what I mean by this, I would point to the difference between biology and particle physics – where scientists set out to test a hypothesis by creating an experiment specifically designed to do so — and astrophysics and economics, where many “experiments” come from seeking out existing phenomena that can help evaluate a hypothesis.

To illustrate why one might need to be careful in the latter case, consider empirical estimates of average long-term GDP growth rate. How would one do this? Since it would be inefficient to wait around for the next 10 years and record the data of all currently existing countries, instead we go back and look at countries that kept records allowing us to compute GDP. But in this case we are only sampling from countries that kept such records, which implies a stable government as well as a reasonable degree of economics expertise within that government. So such a study almost certainly overestimates the actual average growth rate.

Or as another example, we can argue that a scientist is more likely to try to publish a paper if it doesn’t agree with prevalent theories than if it does, so looking merely at the proportion of papers that lend support to or take support away from a theory (even if weighted by the convincingness of each paper) is probably not a good way to determine the validity of a theory.

So why are we safer in the case that we forcibly gather our own data? By gathering our own data, we understand much better (although still not perfectly) the way in which it was constructed, and so there is less room for confounding parameters. In general, we would like it to be the case that the likelihood of observing something that we want to observe does not depend on anything else that we care about — or at the very least, we would like it to depend in a well-defined way.

Let’s consider an example. Suppose that a man comes up to you and says “I have two children. At least one of them is a boy.” What is the probability that they are both boys?

The standard way to solve this is as follows: Assuming that male and female children are equally likely, there is a $\frac{1}{4}$ chance of two girls or two boys, and a $\frac{1}{2}$ chance of having one girl and one boy. Now by Bayes’ theorem,

p(Two boys

At least one boy) = p(At least one boy

Two boys) $\times$ p(Two boys) / p(At least one boy) = 1 $\times$ (1/4) / (1/2+1/4) = 1/3.

So the answer should be 1/3 (if you did math contests in high school, this problem should look quite familiar). However, the answer is not, in fact, 1/3. Why is this? We were given that the man had at least one boy, and we just computed the probability that the man had at two boys given that he had at least one boy using Bayes’ theorem. So what’s up? Is Bayes’ theorem wrong?

No, the answer comes from an unfortunate namespace collision in the word “given”. The man “gave” us the information that he has at least one male child. By this we mean that he asserted the statement “I have at least one male child.” Now our issue is when we confuse this with being “given” that the man has at least one male child, in the sense that we should restrict to the set of universes in which the man has at least one male child. This is a very different statement than the previous one. For instance, it rules out universes where the man has two girls, but is lying to us.

Even if we decide to ignore the possibility that the man is lying, we should note that most universes where the man has at least one son don’t even involve him informing us of this fact, and so it may be the case that proportionally more universes where the man has two boys involve him telling us “I have at least one male child”, relative to the proportion of such universes where the man has one boy and one girl. In this case the probability that he has two boys would end up being greater than 1/3.

The most accurate way to parse this scenario would be to say that we are given (restricted to the set of possible universes) that we are given (the man told us that) that the man has at least one male child. The correct way to apply Bayes’ rule in this case is

p(X has two boys

X says he has $\geq$ 1 boy) = p(X says he has $\geq$ 1 boy

X has two boys) $\times$ p(X has two boys) / p(X says he has $\geq $1 boy)

If we further assume that the man is not lying, and that male and female children are equally likely and uncorrelated, then we get

p(X has two boys

X says he has $\geq$ 1 boy) = [p(X says he has $\geq$ 1 boy

X has two boys) $\times$ 1/4]/[p(X says he has $\geq$ 1 boy

X has two boys) $\times$ 1/4 + p(X says he has $\geq$ 1 boy

X has one boy) $\times$ 1/2]

So if we think that the man is $\alpha$ times more likely to tell us that he has at least one boy when he has two boys, then

p(X has two boys

X says he has $\geq$ 1 boy) = $\frac{\alpha}{\alpha+2}$.

Now this means that if we want to claim that the probability that the man has two boys is $\frac{1}{3}$, what we are really claiming is that he is equally likely to inform us that he has at least one boy, in all situations where it is true, independent of the actual gender distribution of his children.

I would argue that this is quite unlikely, as if he has a boy and a girl, then he could equally well have told us that he has at least one girl, whereas he couldn’t tell us that if he has only boys. So I would personally put $\alpha$ closer to 2, which yields an answer of $\frac{1}{2}$. On the other hand, situations where someone walks up to me and tells me strange facts about the gender distribution of their children are, well, strange. So I would also have to take into account the likely psychology of such a person, which would end up changing my estimate of $\alpha$.

The whole point here is that, because we were an observer receiving information, rather than an experimenter acquiring information, there are all sorts of confounding factors that are difficult to estimate, making it difficult to get a good probability estimate (more on that later). That doesn’t mean that we should give up and blindly guess $\frac{1}{3}$, though — it might feel like doing so gets away without making unwarranted assumptions, but it in fact implicitly makes the assumption that $\alpha = 1$, which as discussed above is almost certainly unwarranted.

What it does mean, though, is that, as scientists, we should try to avoid situations like the one above where there are lots of confounding factors between what we care about and our observations. In particular, we should avoid uncertainties in the source of our information by collecting the information ourselves.

I should note that, even when we construct our own experiments, we should still model the source of our information. But doing so is often much easier. In fact, if we wanted to be particularly pedantic, we really need to restrict to the set of universes in which our personal consciousness receives a particular set of stimuli, but that set of stimuli has almost perfect correlation with photons hitting our eyes, which has almost perfect correlation with the set of objects in front of us, so going to such lengths is rarely necessary — we can usually stop at the level of our personal observations, as long as we remember where they come from.

How to Build Models

Now that I’ve told you that you need to model your information sources, you perhaps care about how to do said modeling. Actually, constructing probabilistic models is an extremely important skill, so even if you ignore the rest of this post, I recommend paying attention to this section.

This section will have the following examples:

Determining if a coin is fair
Finding clusters

Determining if a Coin is Fair

Suppose that you have occasion to observe a coin being flipped (or better yet, you flip it yourself). You do this several times and observe a particular sequence of heads and tails. If you see all heads or all tails, you will probably think the coin is unfair. If you see roughly half heads and half tails, you will probably think the coin is fair. But how do we quantify such a calculation? And what if there are noticeably many more heads than tails, but not so many as to make the coin obviously unfair?

We’ll solve this problem by building up a model in parts. First, there is the thing we care about, namely whether the coin is fair or unfair. So we will construct a random variable X that can take the values Fair and Unfair. Then p(X = Fair) is the probability we assign to a generic coin being fair, and p(X = Unfair) is the probability we assign to a generic coin being unfair.

Now supposing the coin is fair, what do we expect? We expect each flip of the coin to be independent, and have a $\frac{1}{2}$ probability of coming up heads. So if we let F1, F2, …, Fn be the flips of the coin, then p(Fi = Heads

X = Fair) = 0.5.

What if the coin is unfair? Let’s go ahead and blindly assume that the flips will still be independent, and furthermore that each possible weight of the coin is equally likely (this is unrealistic, as weights near 0 or 1 are much more likely than weights near 0.5). Then we have to have an extra variable $\theta$, the probability that the unfair coin comes up heads. So we have p(Unfair coin weight = $\theta$) = 1. Note that this is a probability density, not an actual probability (as opposed to p(Fi = Heads

X = Fair), which was a probability).

Continuing, if F1, F2, …, Fn are the flips of the coin, then p(Fi = Heads

X = Fair, Weight = $\theta$) = $\theta$.

Now we’ve set up a model for this problem. How do we actually calculate a posterior probability of the coin being fair for a given sequence of heads and tails? (A posterior probability is just the technical term for the conditional probability of a hypothesis given a set of data; this is to distinguish it from the prior probability of the hypothesis before seeing any data.)

Well, we’ll still just use Bayes’ rule:

p(Fair

F1, …, Fn) $\propto$ p(F1, …, Fn

Fair) p(Fair) = $\left(\frac{1}{2}\right)^n$ p(Fair)

p(Unfair

F1, …, Fn) $\propto$ p(F1, …, Fn

Unfair) p(Unfair) = $\int_{0}^{1} \theta^{H}(1-\theta)^{T} d\theta$ p(Unfair)

Here H is the number of heads and T is the number of tails. In this case we can fortunately actually compute the integral in question and see that it is equal to $\frac{H!T!}{(H+T+1)!}$. So we get that

p(Fair

F1, …, Fn) / p(Unfair

F1, …, Fn) = p(Fair)/p(Unfair) $\times$ $\frac{(H+T+1)!}{2^n H!T!}$.

It is often useful to draw a diagram of our model to help keep track of it:

Now suppose that we, being specialists in determining if coins are fair, have been called in to study a large collection of coins. We get to one of the coins in the collection, flip it several times, and observe the following sequence of heads and tails:

HHHHTTTHHTTT

Since there are an equal number of heads and tails, our previous analysis will certainly conclude that the coin is fair, but its behavior does seem rather suspicious. In particular, different flips don’t look like they are really independent, so perhaps our previous model is wrong. Maybe the right model is one where the next coin value is usually the same as the previous coin value, but flips with some probability. Now we have a new value of X, which we’ll call Weird, and a parameter $\phi$ (basically the same as $\theta$) that tells us how likely a weird coin is to have a given probability of switching. We’ll again give $\phi$ a uniform distribution over [0,1], so p(Switching probability of weird coin = $\phi$) = 1.

To predict the actual coin flips, we get p(F1 = Heads

X = Weird, Switching probability = $\phi$) = 1, p(F(i+1) = Heads

Fi = Heads, X = Weird, Switching probability = $\phi$) = $1-\phi$, and p(F(i+1) = Heads

Fi = Tails, X = Weird, Switching probability = $\phi$) = $\phi$. We can represent this all with the following graphical model:

Now we are ready to evaluate whether the coin we saw was a Weird coin or not.

p(X = Weird

HHHHTTTHHTTT) $\propto$ p(HHHHTTTHHTTT

X = Weird) p(X = Weird) = $\int_{0}^{1} \frac{1}{2}(1-\phi)^8 \phi^3 d\phi$ p(X = Weird)

Evaluating that integral gives $\frac{8!3!}{2 \cdot 12!} = \frac{1}{3960}$. So p(X = Weird

Data) = p(X = Weird) / 3960, compared to p(X = Fair

Data), which is p(X = Fair) / 4096. In other words, positing a Weird coin only explains the data slightly better than positing a Fair coin, and since the vast majority of coins we encounter are fair, it is quite likely that this one is, as well.

Note: I’d like to draw your attention to a particular subtlety here. Note that I referred to, for instance, “Probability that an unfair coin weight is $\theta$”, as opposed to “Probability that a coin weight is $\theta$ given that it is unfair”. This really is an important distinction, because the distribution over $\theta$ really is the probability distribution over the weights of a generic unfair coin, and this distribution doesn’t change based on whether our current coin happens to be fair or unfair. Of course, we can still condition on our coin being fair or unfair, but that won’t change the probability distribution over $\theta$ one bit.

Finding Clusters

Now let’s suppose that we have a bunch of points (for simplicity, we’ll say in two-dimensional Euclidean space). We would like to group the points into a collection of clusters. Let’s also go ahead and assume that we know in advance that there are $k$ clusters. How do we actually find those clusters?

We’ll make the further heuristic assumption that clusters tend to arise from a “true” version of the cluster, and some Gaussian deviation from that true version. So in other words, if we let there be k means for our clusters, $\mu_1, \mu_2, \ldots, \mu_k$, and multivariate Gaussians about their means with covariance matrices $\Sigma_1, \Sigma_2, \ldots, \Sigma_k$, and finally assume that the probability that a point belongs to cluster i is $\rho_i$, then the probability of a set of points $\vec{x_1}, \vec{x_2}, \ldots, \vec{x_n}$ is

$W_{\mu,\Sigma,\rho}(\vec{x}) := \prod_{i=1}^n \sum_{j=1}^k \frac{\rho_j}{2\pi \det(\Sigma_j)} e^{-\frac{1}{2}(\vec{x_i}-\mu_j)^T \Sigma_j^{-1} (\vec{x_i}-\mu_j)}$

From this, once we pick probability distributions over the $\Sigma$, $\mu$, and $\rho$, we can calculate the posterior probability of a given set of clusters as

$p(\Sigma, \mu, \rho

\vec{x}) \propto p(\Sigma) p(\mu) p(\rho) W_{\mu,\Sigma,\rho}(\vec{x})$

This corresponds to the following graphical model:

Note that once we have a set of clusters, we can also determine the probability that a given point belongs to each cluster:

p($\vec{x}$ belongs to cluster $(\Sigma, \mu, \rho)$) $\propto$ $\frac{\rho}{2\pi \det(\Sigma)} e^{-\frac{1}{2}(\vec{x}-\mu)^T \Sigma^{-1} (\vec{x}-\mu)}$.

You might notice, though, that in this case it is much less straightforward to actually find clusters with high posterior probability (as opposed to in the previous case, where it was quite easy to distinguish between Fair, Unfair, and Weird, and furthermore to figure out the most likely values of $\theta$ and $\phi$). One reason why is that, in the previous case, we really only needed to make one-dimensional searches over $\theta$ and $\phi$ to figure out what the most likely values were. In this case, we need to search over all of the $\Sigma_i$, $\mu_i$, and $\rho_i$ simultaneously, which gives us, essentially, a $3k-1$-dimensional search problem, which becomes exponentially hard quite quickly.

This brings us to an important point, which is that, even if we write down a model, searching over that model can be difficult. So in addition to the model, I will go over a good algorithm for finding the clusters from this model, known as the EM algorithm. For the version of the EM algorithm described below, I will assume that we have uniform priors over $\Sigma_i$, $\mu_i$, and $\rho_i$ (in the last case, we have to do this by picking a set of un-normalized $\rho_i$ uniformly over $\mathbb{R}^k$ and then normalizing). We’ll ignore the problem that it is not clear how to define a uniform distribution over a non-compact space.

The way the EM algorithm works is that we start by initializing $\Sigma_i,$ $\mu_i$, and $\rho_i$ arbitrarily. Then, given these values, we compute the probability that each point belongs to each cluster. Once we have these probabilities, we re-compute the maximum-likelihood values of the $\mu_i$ (as the expected mean of each cluster given how likely each point is to belong to it). Then we find the maximum-likelihood values of the $\Sigma_i$ (as the expected covariance relative to the means we just found). Finally, we find the maximum-likelihood values of the $\rho_i$ (as the expected portion of points that belong to each cluster). We then repeat this until converging on an answer.

For a visualization of how the EM algorithm actually works, and a more detailed description of the two steps, I recommend taking a look at Josh Tenenbaum’s lecture notes starting at slide 38.

The Mind Projection Fallacy

This is perhaps a nitpicky point, but I have found that keeping it in mind has led me to better understanding what I am doing, or at least to ask interesting questions.

The point here is that people often intuitively think of probabilities as a fact about the world, when in reality probabilities are a fact about our model of the world. For instance, one might say that the probability of a child being male versus female is 0.5. And perhaps this is a good thing to say in a generic case. But we also have a much better model of gender, and we know that it is based on X and Y chromosomes. If we could look at a newly conceived ball of cells in a mother’s womb, and read off the chromosomes, then we could say with near certainty whether the child would end up being male or female.

You could also argue that I can empirically measure the probability that a person is male or female, by counting up all the people ever, and looking at the proportion of males and females. But this runs into two issues — first of all, the portion of males will be slightly off of 0.5. So how do we justify just randomly rounding off to 0.5? Or do we not?

Second of all, you can do this all you want, but it doesn’t give me any reason why I should take this information, and use it to form a conjecture about how likely the next person I meet is to be male or female. Once we do that, we are taking into account my model of the world.

Statistics

This final section seeks to look at a result from classical statistics and re-interpret it in a Bayesian framework.

In particular, I’d like to consider the following strategy for rejecting a hypothesis. In abstract terms, it says that, if we have a random variable Data’ that consists of re-drawing our data assuming that our hypothesis is correct, then

$p(Hypothesis) < p(p(Data’

Hypothesis) <= p(Data

Hypothesis))$

In other words, suppose that the probability of drawing data less likely (under our hypothesis) than the data we actually saw is less than $\alpha$. Then the likelihood of our hypothesis is at most $\alpha$.

Or actually, this is not quite true. But it is true that there is an algorithm that will only reject correct hypotheses with probability $\alpha$, and this algorithm is to reject a hypothesis when p(p(Data’

Hypothesis) <= p(Data

Hypothesis)) < $\alpha$. I will leave the proof of this to you, as it is quite easy.

To illustrate this example, let’s suppose (as in a previous section) that we have a coin and would like to determine whether it is fair. In the above method, we would flip it many times, and record the number H of heads. If there is less than an $\alpha$ chance of coming up with a less likely number of heads than H, then we can reject the hypothesis that the coin is fair with confidence $1-\alpha$. For instance, if there are 80 total flips, and H = 25, then we would calculatae

$\alpha = \frac{1}{2^{80}} \left(\sum_{k=0}^{25} \binom{80}{k} + \sum_{k=55}^{80} \binom{80}{k} \right)$.

So this seems like a pretty good test, especially if we choose $\alpha$ to be extremely small (e.g., $10^{-100}$ or so). The mere fact that we reject good hypotheses with probability less than $\alpha$ is not helpful. What we really want is to also reject bad hypotheses with a reasonably large probability. I think you can get around this by repeating the same experiment many times, though.

Of course, Bayesian statistics also can’t ever say that a hypothesis is good, but when given two hypotheses it will always say which one is better. On the other hand, Bayesian statistics has the downside that it is extremely aggressive at making inferences. It will always output an answer, even if it really doesn’t have enough data to arrive at that answer confidently.

Least Squares and Fourier Analysis

2010-08-22T00:00:00-07:00

I ended my last post on a somewhat dire note, claiming that least squares can do pretty terribly when fitting data. It turns out that things aren’t quite as bad as I thought, but most likely worse than you would expect.

The theme of this post is going to be things you use all the time (or at least, would use all the time if you were an electrical engineer), but probably haven’t ever thought deeply about. I’m going to include a combination of mathematical proofs and matlab demonstrations, so there should hopefully be something here for everyone.

My first topic is going to be, as promised, least squares curve fitting. I’ll start by talking about situations when it can fail, and also about situations when it is “optimal” in some well-defined sense. To do that, I’ll have to use some Fourier analysis, which will present a good opportunity to go over when frequency-domain methods can be very useful, when they can fail, and what you can try to do when they fail.

When Least Squares Fails

To start, I’m going to do a simple matlab experiment. I encourage you to follow along if you have matlab (if you have MIT certificates you can get it for free at http://matlab.mit.edu/).

Let’s pretend we have some simple discrete-time process, y(n+1) = a y(n) + b u(n), where y is the variable we care about and u is some input signal. We’ll pick a = 0.8, b = 1.0 for our purposes, and u is chosen to be a discrete version of a random walk. The code below generates the y signal, then uses least squares to recover a and b. (I recommend taking advantage of cell mode if you’re typing this in yourself.)

> %% generate data
> 
> a = 0.8; b = 1.0;
> 
> N = 1000;
> 
> ntape = 1:N; y = zeros(N,1); u = zeros(N-1,1);
> 
> for n=1:N-2
> 
> if rand < 0.02
> 
> u(n+1) = 1-u(n);
> 
> else
> 
> u(n+1) = u(n);
> 
> end
> 
> end
> 
> for n=1:N-1
> 
> y(n+1) = a*y(n)+b*u(n);
> 
> end
> 
> plot(ntape,y);
> 
> %% least squares fit (map y(n) and u(n) to y(n+1))
> 
> A = \[y(1:end-1) u\]; b = y(2:end);
> 
> params = A\\b;
> 
> afit = params(1)
> 
> bfit = params(2)

The results are hardly surprising (you get afit = 0.8, bfit = 1.0). For the benefit of those without matlab, here is a plot of y against n:

Now let’s add some noise to the signal. The code below generates noise whose size is about 6% of the size of the data (in the sense of L2 norm).

> %%
> 
> yn = y + 0.1*randn(N,1); % gaussian noise with standard deviation 0.2
> 
> A = \[yn(1:end-1) u\]; b = yn(2:end);
> 
> params = A\\b;
> 
> afit = params(1)
> 
> bfit = params(2)

This time the results are much worse: afit = 0.7748, bfit = 1.1135. You might be tempted to say that this isn’t so much worse than we might expect – the accuracy of our parameters is roughly the accuracy of our data. The problem is that, if you keep running the code above (which will generate new noise each time), you will always end up with afit close to 0.77 and bfit close to 1.15. In other words, the parameters are systematically biased by the noise. Also, we should expect our accuracy to increase with more samples, but that isn’t the case here. If we change N to 100,000, we get afit = 0.7716, bfit = 1.1298. More samples will decrease the standard deviation of our answer (running the code multiple times will yield increasingly similar results), but not necessarily its correctness.

A more dire way of thinking about this is that increasing the number of samples will increase how “certain” we are of our answer, but it won’t change the fact that our answer is wrong. So we will end up being quite certain of an incorrect answer.

Why does this happen? It turns out that when we use least squares, we are making certain assumptions about the structure of our noise, and those assumptions don’t hold in the example above. In particular, in a model like the one above, least squares assumes that all noise is process noise, meaning that noise at one step gets propagated to future steps. Such noise might come from a system with unmodelled friction or some external physical disturbance. In contrast, the noise we have is output noise, meaning that the reading of our signal is slightly off. What the above example shows is that a model constructed via least squares will be systematically biased by output noise.

That’s the intuition, now let’s get into the math.When we do least squares, we are trying to solve some equation Ax=b for x, where A and b are both noisy. So we really have something like A+An and b+bn, where An and bn are the noise on A and b.

Before we continue, I think it’s best to stop and think about what we really want. So what is it that we actually want? We observe a bunch of data as input, and some more data as output. We would like a way of predicting, given the input, what the output should be. In this sense, then, the distinction between “input noise” (An) and “output noise” bn is meaningless, as we don’t get to see either and all they do is cause b to not be exactly Ax. (If we start with assumptions on what noise “looks like”, then distinguishing between different sources of noise turns out to be actually useful. More on that later.)

If the above paragraph isn’t satisfying, then we can use the more algebraic explanation that the noise An and bn induces a single random variable on the relationship between observed input and observed output. In fact, if we let A’=A+An then we end up fitting $A’x = b+(b_n-A_nx)$, so we can just define $e = b_n-A_nx$ and have a single noise term.

Now, back to least squares. Least squares tries to minimize $\|Ax-b\|_2^2$, that is, the squared error in the $l^2$ norm. If we instead have a noisy $b$, then we are trying to minimize $\|Ax-b-e\|_2^2$, which will happen when $x$ satisfies $A^TAx = A^T(b+e)$.

If there actually exists an $\hat{x}$ such that $A\hat{x} = b$ (which is what we are positing, subject to some error term), then minimizing $Ax-b$ is achieved by setting $x$ to $\hat{x}$. Note that $\hat{x}$ is what we would like to recover. So $\hat{x}$ would be the solution to $A^TAx = A^Tb$, and thus we see that an error $e$ introduces a linear error in our estimate of $x$. (To be precise, the error affects our answer for $x$ via the operator $(A^TA)^{-1}A^T$.)

Now all this can be seen relatively easily by just using the standard formula for the solution to least squares: $x = (A^TA)^{-1}A^Tb$. But I find that it is easy to get confused about what exactly the “true” answer is when you are fitting data, so I wanted to go through each step carefully.

At any rate, we have a formula for how the error $e$ affects our estimate of $x$, now I think there are two major questions to answer:

In what way can $e$ systematically bias our estimate for $x$?
What can we say about the variance on our estimate for $x$?

To calculate the bias on $x$, we need to calculate $\mathbb{E}((A^TA)^{-1}A^Te)$, where $\mathbb{E}$ stands for expected value. Since $(A^TA)^{-1}$ is invertible, this is the same as $(A^TA)^{-1}\mathbb{E}(A^Te)$. In particular, we will get an unbiased estimate exactly when $A$ and $e$ are uncorrelated. Most importantly, when we have noise on our inputs then $A$ and $e$ will (probably) be correlated, and we won’t get an unbiased result.

How bad is the bias? Well, if A actually has a noise component (i.e. $A=A_0+A_n$), and e is $b_n-A_nx$, and we assume that our noise is uncorrelated with the constant matrix $A$, then we get a correlation matrix equal to $A_n^T(b_n-A_nx)$, which, assuming that $A_n$ and $b_n$ are uncorrelated, gives us $-A_n^TA_nx$. The overall bias then comes out to $-(A^TA)^{-1}\mathbb{E}(A_n^TA_n)x$.

I unfortunately don’t have as nice of an expression for the variance, although you can of course calculate it in terms of $A, b, x, A_n$, and $b_n$.

At any rate, if noise doesn’t show up in the input, and the noise that does show up is uncorrelated with the input, then we should end up with no bias. But if either of those things is true, we will end up with bias. When modelling a dynamical system, input noise corresponds to measurement noise (your sensors are imperfect), while output noise corresponds to process noise (the system doesn’t behave exactly as expected).

One way we can see how noise being correlated with $A$ can lead to bias is if our “noise” is actually an unmodelled quadratic term. Imagine trying to fit a line to a parabola. You won’t actually fit the tangent line to the parabola, instead you’ll probably end up fitting something that looks like a secant. However, the exact slope of the line you pick will depend pretty strongly on the distribution of points you sample along the parabola. Depending on what you want the linear model for, this could either be fine (as long as you sample a distribution of points that matches the distribution of situations that you think you’ll end up using the model for), or very annoying (if you really wanted the tangent).

If you’re actually just dealing with a parabola, then you can still get the tangent by sampling symmetrically about the point you care about, but once you get to a cubic this is no longer the case.

As a final note, one reasonable way (although I’m not convinced it’s the best, or even a particularly robust way) of determining if a linear fit of your data is likely to return something meaningful is to look at the condition number of your matrix, which can be computed in matlab using the cond function and can also be realized as the square root of the ratio of the largest to the smallest eigenvalue of $A^TA$. Note that the condition number says nothing about whether your data has a reasonable linear fit (it can’t, since it doesn’t take $b$ into account). Rather, it is a measure of how well-defined the coefficients of such a fit would be. In particular, it will be large if your data is close to lying on a lower-dimensional subspace (which can end up really screwing up your fit). In this case, you either need to collect better data or figure out why your data lies on a lower-dimensional subspace (it could be that there is some additional structure to your system that you didn’t think about; see point (3) below about a system that is heavily damped).

I originally wanted to write down a lot more about specific ways that noise can come into the picture, but I haven’t worked it all out myself, and it’s probably too ambitious a project for a single blog post anyways. So instead I’m going to leave you with a bunch of things to think about. I know the answers to some of these, for others I have ideas, and for others I’m still trying to work out a good answer.

Can anything be done to deal with measurement noise? In particular, can anything be done to deal with the sort of noise that comes from encoders (i.e., a discretization of the signal)?
Is there a good way of measuring when noise will be problematic to our fit?
How can we fit models to systems that evolve on multiple time-scales? For example, an extremely damped system such as $\dot{x}_1 = x_2$, $\dot{x}_2 = -cx_1-Mx_2$, where $M \gg c$. You could take, for example, $M = 20$, $c = 1$, in which case the system behaves almost identically to the system $\dot{x}_1 = \frac{-M + \sqrt{M^2-4c}}{2} x_1$ with $x_2$ set to the derivative of $x_1$. Then the data will all lie almost on a line, which can end up screwing up your fit. So in what exact ways can your fit get screwed up, and what can be done to deal with it? (This is essentially the problem that I’m working on right now.)
Is there a way to defend against non-linearities in a system messing up our fit? Can we figure out when these non-linearities occur, and to what extent?
What problems might arise when we try to fit a system that is unstable or only slightly stable, and what is a good strategy for modelling such a system?

When Least Squares Works

Now that I’ve convinced you that least squares can run into problems, let’s talk about when it can do well.

As Paul Christiano pointed out to me, when you have some system where you can actually give it inputs and measure the outputs, least squares is likely to do a fairly good job. This is because you can (in principle) draw the data you use to fit your model from the same distribution as you expect to encounter when the model is used in practice. However, you will still run into the problem that failure to measure the input accurately introduces biases. And no, these biases can’t be eradicated completely by averaging the result across many samples, because the bias is always a negative definite matrix applied to $x$ (the parameters we are trying to find), and any convex combination of negative definite matrices will remain negative definite.

Intuitively, what this says is that if you can’t trust your input, then you shouldn’t rely on it strongly as a predictor. Unfortunately, the only way that a linear model knows how to trust something less is by making the coefficient on that quantity “smaller” in some sense (in the negative definite sense here). So really the issue is that least squares is too “dumb” to deal with the issue of measurement error on the input.

But I said that I’d give examples of when least squares works, and here I am telling you more about why it fails. One powerful and unexpected aspect of least squares is that it can fit a wide variety of non-linear models. For example, if we have a system $y = c_1+c_2x+c_3x^2+c_4\cos(x)$, then we just form a matrix $A = \left[ \begin{array}{cccc} 1 & x & x^2 & \cos(x) \end{array} \right]$ and $b = y$, where for example $\cos(x)$ is actually a column vector where the $i$th row is the cosine of the $i$th piece of input data. This will often be the case in physical systems, and I think is always the case for systems solved via Newton’s laws (although you might have to consolide parameters, for example fitting both $mgl$ and $ml^2$ in the case of a pendulum). This isn’t necessarily the case for reduced models of complicated systems, for example the sort of models used for fluid dynamics. However, I think that the fact that linear fitting techniques can be applied to such a rich class of systems is quite amazing.

There is also a place where least squares not only works but is in some sense optimal: detecting the frequency response of a system. Actually, it is only optimal in certain situations, but even outside of those situations it has many advantages over a standard discrete Fourier transform. To get into the applications of least squares here, I’m going to have to take a detour into Fourier analysis.

Fourier Analysis

If you already know Fourier analysis, you can probably skip most of this section (although I recommend reading the last two paragraphs).

Suppose that we have a sequence of $N$ signals at equally spaced points in time. Call this sequence $x_1$, $x_2$, $\ldots$, $x_n$. We can think of this as a function $f : \{0,1,\ldots,N-1\} \to \mathbb{R}$, or, more accurately, $f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{R}$. For reasons that will become apparent later, we will actually think of this as a function $f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{C}$.

This function is part of the vector space of all functions from $\{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\}$ to $\mathbb{C}$. One can show that the functions on $\{0,\Delta t,\ldots,(N-1)\Delta t\}$ defined by

$f_k(x) = e^{\frac{2\pi i k x}{N \Delta t}},$

with $k$ ranging from $0$ to $N-1$, are all orthogonal to each other, and thus form a basis for the space of all functions from $\{0,\Delta t,2\Delta t,\ldots,(N-1)\Delta y\}$ to $\mathbb{C}$ (now it is important to use $\mathbb{C}$ since the $f_k$ take on complex values). It follows that our function $f$ can be written uniquely in the form $f(x) = \sum_{k=0}^{N-1} c_kf_k(x)$, where the $c_k$ are constants. Now because of this we can associate with each $f$ a function $\hat{f} : \{0,\frac{2 \pi}{N \Delta t},\frac{4\pi}{N\Delta t},\ldots,\frac{(N-1)\pi}{N\delta t}\} \to \mathbb{C}$ given by $\hat{f}(\frac{2\pi k}{N \Delta t}) := c_k$.

An intuitive way of thinking about this is that any function can be uniquely decomposed as a superposition of complex exponential functions at different frequencies. The function $\hat{f}$ is a measure of the component of $f$ at each of these frequencies. We refer to $\hat{f}$ as the Fourier transform of $f$.

While there’s a lot more that could be said on this, and I’m tempted to re-summarize all of the major results in Fourier analysis, I’m going to refrain from doing so because there are plenty of texts on it and you can probably get the relevant information (such as how to compute the Fourier coefficients, the inverse Fourier transform, etc.) from those. In fact, you could start by checking out Wikipedia’s article. It is also worth noting that the Fourier transform can be computed in $O(N\log N)$ time using any one of many “fast Fourier transform” algorithms (fft in matlab).

I will, however, draw your attention to the fact that if we start with information about $f$ at times $\{0,\Delta t,\ldots, (N-1)\Delta t\}$, then we end up with frequency information at the frequencies $\{0,\frac{2\pi}{N\Delta t},\ldots,\frac{2\pi(N-1)}{N\Delta t}\}$. Also, you should really think of the frequencies as wrapping around cyclically (frequencies that differ from each other by a multiple of $\frac{2\pi}{\Delta t}$ are indistinguishable on the interval we sampled over), and also if $f$ is real-valued then $\hat(f)(-\omega) = \overline{\hat{f}(\omega)}$, where the bar means complex conjugate and $-\omega$ is, as just noted, the same as $\frac{2\pi}{\Delta t}-\omega$.

A final note before continuing is that we could have decomposed $f$ into a set of almost any $N$ frequencies (as long as they were still linearly independent), although we can’t necessarily do so in $O(N\log N)$ time. We will focus on the set of frequencies obtained by a Fourier transform for now.

When Fourier Analysis Fails

The goal of taking a Fourier transform is generally to decompose a signal into component frequencies, under the assumption that the signal itself was generated by some “true” superposition of frequencies. This “true” superposition would best be defined as the frequency spectrum we would get if we had an infinitely long continuous tape of noise-free measurements and then took the continuous Fourier transform.

I’ve already indicated one case in which Fourier analysis can fail, and this is given by the fact that the Fourier transform can’t distinguish between frequencies that are separated from each other by multiples of $\frac{2\pi}{\Delta t}$. In fact, what happens in general is that you run into problems when your signal contains frequencies that move faster than your sampling rate. The rule of thumb is that your signal should contain no significant frequency content above the Nyquist rate, which is half the sampling frequency. One way to think of this is that the “larger” half of our frequencies (i.e. $\frac{\pi}{\Delta t}$ up through $\frac{2\pi}{\Delta t}$) are really just the negatives of the smaller half of our frequencies, and so we can measure frequencies up to roughly $\frac{\pi}{\Delta t}$ before different frequencies start to run into each other.

The general phenomenon that goes on here is known as aliasing, and is the same sort of effect as what happens when you spin a bicycle wheel really fast and it appears to be moving backwards instead of forwards. The issue is that your eye only samples at a given rate and so rotations at speeds faster than that appear the same to you as backwards motion. See also this image from Wikipedia and the section in the aliasing article about sampling sinusoidal functions.

The take-away message here is that you need to sample fast enough to capture all of the actual motion in your data, and the way you solve aliasing issues is by increasing the sample rate.

A trickier problem is the “windowing” problem, also known as spectral leakage. [Note: I really recommend reading the linked wikipedia article at some point, as it is a very well-written and insightful treatment of this issue.] The problem can be summarized intuitively as follows: nearby frequencies will “bleed into” each other, and the easiest way to reduce this phenomenon is to increase your sample time. Another intuitive statement to this effect is that the extent to which you can distinguish between two nearby frequencies is roughly proportional to the number of full periods that you observe of their difference frequency. I will make both of these statements precise below. First, though, let me convince you that spectral leakage is relevant by showing you what the Fourier transform of a periodic signal looks like when the period doesn’t fit into the sampling window. The first image below is a plot of y=cos(t), and the second is a snapshot of part of the Fourier transform (blue is real part, green is imaginary part). Note that the plot linearly interpolates between sample points. Also note that the sampling frequency was 100Hz, although that is almost completely irrelevant.

The actual frequency content should be a single spike at $\omega = 1$, so windowing can in fact cause non-trivial issues with your data.

Now let’s get down to the actual analytical reason for the windowing / spectral leakage issue. Recall the formula for the Fourier transform: $\hat{f}(\omega) = \frac{1}{N} \sum_{t} f(t)e^{-i\omega t}$. Now suppose that $f$ is a complex exponential with some frequency $\omega’$, i.e. $f(t) = e^{i\omega’ t}$. Then some algebra will yield the formula

$\hat{f}(\omega) = \frac{1}{N} \frac{e^{i(\omega’-\omega)N\Delta t}-1}{e^{i(\omega’-\omega)\Delta t}-1}$,

which tells us the extent to which a signal at a frequency of $\omega’$ will incorrectly contribute to the estimate of the frequency content at $\omega$. The main thing to note here is that larger values of $N$ will cause this function to become more concentrated horizontally, which means that, in general (although not necessarily at a given point), it will become smaller. At the same time, if you change the sampling rate without changing the total sampling time then you won’t significantly affect the function. This means that the easiest way to decrease windowing is to increase the amount of time that you sample your signal, but that sampling more often will not help you at all.

Another point is that spectral leakage is generically roughly proportional to the inverse of the distance between the two frequencies (although it goes to zero when the difference in frequencies is close to a multiple of $\frac{2\pi}{N\Delta t}$), which quantifies the earlier statement about the extent to which two frequencies can be separated from each other.

Some other issues to keep in mind: the Fourier transform won’t do a good job with quasi-periodic data (data that is roughly periodic with a slowly-moving phase shift), and there is also no guarantee that your data will have good structure in the frequency domain. It just happens that this is in theory the case for analytic systems with a periodic excitation (see note (1) in the last section of this post – “Answers to Selected Exercises” – for a more detailed explanation).

When Fourier Analysis Succeeds

Despite issues with aliasing and spectral leakage, there are some strong points to the Fourier transform. The first is that, since the Fourier transform is an orthogonal map, it does not amplify noise. More precisely, $\|\hat{f}-\hat{f_0}\|_2 = \frac{1}{\sqrt{N}}\|f-f_0\|_2$, so two signals that are close together have Fourier transforms that are also close together. This may be somewhat surprising since normally when one fits $N$ parameters to a signal of length $N$, there are significant issues with overfitting that can cause noise to be amplified substantially.

However, while the Fourier transform does not amplify noise, it can concentrate noise. In particular, if the noise has some sort of quasi-periodic structure then it will be concentrated over a fairly small range of frequencies.

Note, though, that the L2 norm of the noise in the frequency domain will be roughly constant relative to the number of samples. This is because, if $f_0$ is the “true” signal and $f$ is the measured signal, then $\|f-f_0\|_2 = \Theta(\sqrt(N))$, so that $\|\hat{f}-\hat{f_0}\|_2 = \Theta(1)$. Now also note that the number of frequency measurements we get out of the Fourier transform within a fixed band is proportional to the sampling time, that is, it is $\Theta(N\Delta t)$. If we put these assumptions together, and also assume that the noise is quasi-periodic such that it will be concentrated over a fixed set of frequencies, then we get $\Theta(1)$ noise distributed in the L2 sense over $\Theta(N\Delta t)$ frequencies, which implies that the level of noise at a given frequency should be $\Theta(\frac{1}{\sqrt{N\Delta t}})$. In other words, sampling for a longer time will increase our resolution on frequency measurements, which means that the noise at a given frequency will decrease as the square-root of the sampling time, which is nice.

My second point is merely that there is no spectral leakage between frequencies that differ by multiples of $\frac{2\pi}{N\Delta t}$, so in the special case when all significant frequency content of the signal occurs at frequencies that are multiples of $\frac{2\pi}{N\Delta t}$ and that are less than $\frac{\pi}{\Delta t},$ all problems with windowing and aliasing go away and we do actually get a perfect measure of the frequency content of the original signal.

Least Squares as a Substitute

The Fourier transform gives us information about the frequency content at $0, \frac{2\pi}{N\Delta t}, \frac{4\pi}{N\Delta t}, \ldots, \frac{2(N-1)\pi}{N\Delta t}$. However, this set of frequencies is somewhat arbitrary and might not match up well to the “important” frequencies in the data. If we have extra information about the specific set of frequencies we should be caring about, then a good substitute for Fourier analysis is to do least squares fitting to the signal as a superposition of the frequencies you care about.

In the special case that the frequencies you care about are a subset of the frequencies provided by the Fourier transform, you will get identical results (this has to do with the fact that complex exponentials at these frequencies are all orthogonal to each other).

In the special case that you exactly identify which frequencies occur in the signal, you eliminate the spectral leakage problem entirely (it still occurs in theory, but not between any of the frequencies that actually occur). A good way to do this in the case of a dynamical system is to excite the system at a fixed frequency so that you know to look for that frequency plus small harmonics of that frequency in the output.

In typical cases least squares will be fairly resistant to noise unless that noise has non-trivial spectral content at frequencies near those being fit. This is almost tautologically true, as it just says that spectral leakage is small between frequencies that aren’t close together. However, this isn’t exactly true, as fitting non-orthogonal frequencies changes the sort of spectral leakage that you get, and picking a “bad” set of frequencies (usually meaning large condition number) can cause lots of spectral leakage even between far apart frequencies, or else drastically exacerbate the effects of noise.

This leads to one reason not to use least squares and to use the Fourier transform instead (other than the fact that the Fourier transform is more efficient in an algorithmic sense at getting data about large sets of frequencies – $\Theta(N\log(N))$ instead of $\Theta(N^2)$). The Fourier transform always has a condition number of $1$, whereas least squares will in general have a condition number greater than $1$, and poor choices of frequencies can lead to very large condition numbers. I typically run into this problem when I attempt to gain lots of resolution on a fixed range of frequencies.

This makes sense, because there are information-theoretic limits on the amount of frequency data I can get out of a given amount of time-domain data, and if I could zoom in on a given frequency individually, then I could just do that for all frequencies one-by-one and break the information theory bounds. To beat these bounds you will have to at least implicitly make additional assumptions about the structure of the data. However, I think you can probably get pretty good results without making too strong of assumptions, but I unfortunately don’t personally know how to do that yet.

So to summarize, the Fourier transform is nice because it is orthogonal and can be computed quickly. Least squares is nice because it allows you to pick which frequencies you want and so gives you a way to encode additional information you might have about the structure of the signal.

Some interesting questions to ask:

What does spectral leakage look like for non-orthogonal sets of frequencies? What do the “bad” cases look like?
What is a good set of assumptions to make that helps us get better frequency information? (The weaker the assumption and the more leverage you get out of it, the better it is.)
Perhaps we could try something like: “pick the smallest set of frequencies that gives us a good fit to the data”. How could we actually implement this in practice, and would it have any shortcomings? How good would it be at pulling weak signals out of noisy data?
What in general is a good strategy for pulling a weak signal out of noisy data?
What is a good way of dealing with quasi-periodic noise?
Is there a way to deal with windowing issues, perhaps by making statistical assumptions about the data that allows us to “sample” from possible hypothetical continuations of the signal to later points in time?

Take-away lessons

To summarize, I would say the following:

Least squares

good when you get to sample from a distribution of inputs that matches the actual distribution that you’re going to deal with in practice
bad due to systematic biases when noise is correlated with signal (usually occurs with “output noise” in the case of a dynamical system)

Fourier transform

good for getting a large set of frequency data
good because of small condition number
can fail due to aliasing
also can be bad due to spectral leakage, which can be dealt with by using least squares if you have good information about which frequencies are important

Answers to selected exercises

Okay well mainly I just feel like some of the questions that I gave as exercises are important enough that you should know the answer. There isn’t necessarily a single answer, but I’ll at least give you a good way of doing something if I know of one. It turns out that for this post I only have one good answer, which is about dealing with non-linear dynamical systems.

We can figure out if a dynamical system is non-linear (and get some quantitative data about the non-linearities we’re dealing with) by inputting a signal that has only a few frequencies (i.e., the superposition of a small number of sines and cosines) and then looking at the Fourier transform of the response. If the system is completely linear, then the response should contain the same set of frequencies as the input (plus a bit of noise). If the system is non-linear but still analytic then you will also see responses at integer linear combinations of the input frequencies. If the system is non-analytic (for example due to Coulombic friction, the type of friction you usually assume in introductory physics classes) then you might see a weirder frequency response.

Linear Control Theory: Part I

2010-07-17T00:00:00-07:00

Last time I talked about linear control, I presented a Linear Quadratic Regulator as a general purpose hammer for solving linear control problems. In this post I’m going to explain why LQR by itself is not enough (even for nominally linear systems). (Author’s note: I got to the end of the post and realized I didn’t fulfill my promise in the previous sentence. So it’s redacted, but will hopefully be dealt with in a later post.) Then I’m going to do my best to introduce a lot of the standard ideas in linear control theory.

My motivation for this is that, even though these ideas have a reasonably nice theory from a mathematical standpoint, they are generally presented from an engineering standpoint. And although all of the math is right there, and I’m sure that professional control theorists understand it much better than I do, I found that I had to go to a lot of effort to synthesize a good mathematical explanation of the underlying theory.

However, this effort was not due to any inherent difficulties in the theory itself, but rather, like I said, a disconnect in the intuition of, and issues relevant to, an engineer versus a mathematician. I’m not going to claim that one way of thinking is better than the other, but my way of thinking, and I assume that of most of my audience, falls more in line with the mathematical viewpoint. What’s even better is that many of the techniques built up for control theory have interesting ramifications when considered as statements about vector spaces. I hope that you’ll find the exposition illuminating.

As before, we will consider a linear system

$\dot{x} = Ax+Bu,$

where $A$ and $B$ are matrices and $u$ is a vector of control inputs ($x$ is the state of the system). However, in addition to a control input $u$, we will have an output $y$, such that $y$ is a function of $x$ and $u$:

$y = Cx+Du.$

In some cases, $y$ will be a set of observed states of a system, but in principal $y$ can be any quantity we care about, provided that it is a linear function of state and control. We further assume that $A$, $B$, $C$, and $D$ are constant with respect to time. We call a system that follows this assumption a linear time-invariant system, or just LTI system.

Since the system is linear, we have superposition and therefore can break up any function (for example, the function from $u(t)$ to $y(t)$) into a function from each coordinate of $u(t)$ to each coordinate of $y(t)$. For each of these functions, we can take their Laplace transform. So, we start with

$\dot{x} = Ax+Bu$

$y = Cx+Du$

and end up with (after taking the Laplace transform)

$sX = AX+BU$

$Y = CX+DU.$

Solving these two equations for $Y$ as a function of $U$ gives $Y = (C(sI-A)^{-1}B+D)U$. We call this mapping from $U$ to $Y$ the transfer function of the system. Cramer’s Rule implies that the transfer function of any linear time-invariant system will be a matrix where each entry is a ratio of two polynomials. We refer to such transfer functions as rational. I will show later that the converse is also true: any rational matrix is the transfer function of some LTI system. We call such an LTI system the state-space representation of the transfer function. (I apologize for throwing all this terminology at you, but it is used pretty unapologetically in control systems literature so I’d feel bad leaving it out.)

As an example, consider a damped harmonic oscillator with an external force $u$ as a control input, and suppose that the outputs we care about are position and velocity. We will let $q$ denote the position of the oscillator. This has the following state-space representation:

$\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{cc} 0 & 1 \\ -k & -b \end{array} \right] \left[ \begin{array}{c} q \\ \dot{q} \end{array} \right] + \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] u$

$\left[ \begin{array}{c} y_1 \\ y_2 \end{array} \right] = \left[ \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right] \left[ \begin{array}{c} q \\ \dot{q} \end{array} \right] + 0 \cdot u$

Here $k$ is the spring constant of the oscillator and $b$ is the damping factor. For convenience we will write $x$ instead of $\left[ \begin{array}{c} q \\ \dot{q} \end{array} \right]$ and $y$ instead of $\left[ \begin{array}{c} y_1 \\ y_2 \end{array} \right]$. Also, we will let $I$ denote the $2 \times 2$ identity matrix. Then, after taking the Laplace transform, we get

$sX = \left[ \begin{array}{cc} 0 & 1 \\ -k & -b \end{array} \right]X + \left[ \begin{array}{c} 0 \\ 1 \end{array} \right]U$

$Y = X.$

Solving the first equation gives

$\left[ \begin{array}{cc} s & -1 \\ k & s+b \end{array} \right] X = \left[ \begin{array}{c} 0 \\ 1 \end{array} \right]U,$

$X = \frac{1}{s^2+bs+k}\left[ \begin{array}{cc} s+b & 1 \\ -k & s \end{array} \right]\left[ \begin{array}{c} 0 \\ 1 \end{array}\right]U = \frac{1}{s^2+bs+k} \left[ \begin{array}{c} 1 \\ s \end{array} \right]U$

Therefore, the transfer function from $U$ to $Y$ is $\frac{1}{s^2+bs+k} \left[ \begin{array}{c} 1 \\ s \end{array} \right]$.

We can think of the transfer function as a multiplier on the frequency spectrum of $u$ (note that $s$ is allowed to be an arbitrary complex number; if $s$ is non-real then we have oscillation at a frequency equal to the imaginary part of $s$; if $\Re(s) < 0$ then we have damped oscillation, whereas if $\Re(s) > 0$ then the magnitude of the oscillation increases exponentially. Note that $\Re(s)$ denotes the real part of $s$.

Exercise: What does a pole of a transfer function correspond to? What about a zero? Answers below the fold.

If a transfer function has a pole, then it means that even if a given frequency doesn’t show up in the input $u$, it can still show up in the output $y$. Thus it is some self-sustaining, natural mode of the system. For LTI systems, this corresponds to an eigenvector of the matrix $A$, and the location of the pole is the corresponding eigenvalue.

A zero, on the other hand, means that a mode will not show up in the output even if it is present in the input. So for instance, the damped oscillator has poles at $\frac{-b \pm \sqrt{b^2-4k}}{2}$. Let us assume that $b$ and $k$ are both positive for the damped oscillator. Then, for $b \geq 2\sqrt{k}$, both of the poles are real and negative, meaning that the system is critically damped. For $b < 2\sqrt{k}$, the poles have negative real part and imaginary part equal to $\sqrt{k-\frac{b^2}{4}}$, meaning that the system will exhibit damped oscillation. Finally, there is a zero in the second coordinate of the transfer matrix at $s = 0$. This corresponds to the fact that a harmonic oscillator can be held at a fixed distance from its natural fixed point by a fixed external force. Since the distance is fixed, the contribution to velocity is zero.

There is more to be said on transfer functions, but before I go into that I would like to give you a working picture of how $u$ and $y$ should be viewed mathematically. This is a view that I only recently acquired. For this I owe thanks to Stefano Stramigioli, who gave a very interesting talk on port-Hamiltonian methods at Dynamic Walking 2010. (Update: Stefano recommends this book as a resource for learning more.)

Duality

Here is how I think you should think about linear control mathematically. First, you have a state-space $V$. You also have a space of controls $U$ and a space of outputs $Y$. Finally, you have a space $TV$, the tangent space to $V$.

Ignoring $U$ and $Y$ for a moment, let’s just focus on $V$ and $TV$. We can think of elements of $TV$ as generalized forces, and the elements of $V$ as generalized velocities. I realize that state-space also takes position into account, but you will note that no external forces show up in the equations for position, so I think this view still makes sense.

If we have a set of forces and velocities, then we can compute power (if our system is in regular Cartesian coordinates, then this is just $\vec{F} \cdot \vec{v}$). In this way, we can think of $V$ and $TV$ as dual to each other. I think that generalized velocities are actually somehow supposed to live in the cotangent space $T^*V$, rather than $V$, but I don’t know enough analysis to see why this is true. If someone else does, I would love to hear your explanation.

At any rate, we have these two spaces, $V$ and $TV$, that are in duality with each other. The operator $A : V \to TV$ then induces a map $\tilde{A}$ from $\mathcal{L}^{1}(\mathbb{R},TV)$ to $\mathcal{L}^{1}(\mathbb{R},V)$, where $\mathcal{L}^{1}(X,Y)$ is the space of Lesbegue-integrable functions from $X$ to $Y$ (although in practice all of our inputs and outputs will be real-valued, not complex-valued, since the systems we care about are all physical). Since $V$ and $TV$ are in duality with each other, we can also think of this as assigning a power history to any force history (the power history being $[\tilde{A}(f)](f)$, where $f$ is the force history).

What’s more remarkable is that the transfer function from force histories to state histories is $(sI-A)^{-1}$ in the Laplace domain (as discussed above – just set $B = C = I$ for the state-space representation). Therefore it is invertible except on a set of measure zero (the poles of $A$) and so as far as $\mathcal{L}^{1}$ spaces are concerned it is an isomorphism; this is a bit of a technical point here, but I’m using the fact that $\mathcal{L}^1$ spaces are composed of equivalence classes of functions that differ on sets of measure zero, and also probably implicitly using some theorems from Fourier analysis about how the Fourier (Laplace) transform is an isomorphism from $\mathcal{L}^{1}(\mathbb{R},V)$ to itself. I’m still glossing over some technical details here; in particular, I think you might need to consider the intersection of $\mathcal{L}^1$ and $\mathcal{L}^2$ instead of just $\mathcal{L}^1$, and also the target space of the Fourier transform is really $\mathcal{L}^{1}(\widehat{\mathbb{R}},V)$, not $\mathcal{L}^1(\mathbb{R},V)$, but these details aren’t really important to the exposition.

Getting back on track, we’ve just shown that the dynamics matrix $A$ of a linear system induces an isomorphism between force histories and state histories. My guess is that you can also show this for reasonably nice non-linear systems, but I don’t have a proof off the top of my head. So, letting $U$ denote the space of control signals and $Y$ the space of outputs, what we have is something like this:

$\mathcal{L}^{1}(\mathbb{R},U) \xrightarrow{B} \mathcal{L}^{1}(\mathbb{R},TV) \xrightarrow{\overset{\tilde{A}}{\sim}} \mathcal{L}^{1}(\mathbb{R},V) \xrightarrow{C} \mathcal{L}^{1}(\mathbb{R},Y)$

Incidentally, that middle map (the isomorphism with $\tilde{A}$) is hideous-looking, and if someone has a good way to typeset such a thing I would like to know about it.

In any case, in this context it is pretty easy to see how the inputs and outputs play dual roles to each other, and in fact if we replaced $A$, $B$, and $C$ each with their adjoints $A^{\dagger}$, $B^{\dagger}$, and $C^{\dagger}$, then we get a new dynamical system where the inputs and outputs actually switch places (as well as the matrices governing the inputs and outputs). Note that I’ve left $D$ out of this for now. I’m not really sure yet of a good way to fit it into this picture; it’s possible that $D$ is just unnatural mathematically but sometimes necessary physically (although usually we can assume that $D = 0$).

Now that we have this nice framework for thinking about linear control systems, I’m going to introduce controllers and observers, and it will be easy to see that they are dual to each other in the sense just described.

Controllability and Observability

Go back to the non-linear case for a moment and suppose that we have a system $\dot{x} = f(x,u)$, or, in the notation I’ve been using, $\dot{x} = f(x) + Bu$. We say that such a system is controllable if for any two states $x_1$ and $x_2$, there exists a time $t_0 > 0$ and a control signal $u(t)$ such that if $x(0) = x_1$ then $x(t_0) = x_2$ when the system is driven by the control signal $u(t)$. What this says intuitively is that we can get from any state to any other state in a finite amount of time.

For linear systems, controllability implies something stronger — we can actually get from any state to any other state arbitrarily quickly, and this is often times the definition given in the linear case. For non-linear systems, this is not the case, as a trivial example we could have

$\dot{x_1} = u$

$\dot{x_2} = max(x_1,1)$

There are a few important properties of linear systems that are equivalent to controllability:

(1) There is no proper subspace $W$ of the state space such that $A(W) \subset W$ and $B(U) \subset W$, where $U$ is the space of possible instantaneous control signals. The intuition is that there is no subspace that the passive dynamics (without control) can get stuck in such that the control input can’t move the dynamics out of that space.

(2) There is no left eigenvector of $A$ that is in the left null space of $B$. In other words, it actually suffices to check the criterion (1) above just for one-dimensional subspaces.

(3) The matrix $[B \ AB \ A^2B \ \ldots \ A^{n-1}B]$, where $n$ is the dimension of the state space of the system, has full row rank.

(4) For any choice of $n$ eigenvalues $\lambda_1, \ldots, \lambda_n$, there exists a matrix $F$ such that $A+BF$ has generalized eigenvalues $\lambda_1, \ldots, \lambda_n$. We can think of this as saying that an appropriate linear feedback law $u = Fx$ can be used to give the closed-loop (i.e. after control is applied) dynamics arbitrary eigenvalues.

I will leave (1) and (2) to you as exercises. Note that this is because I actually think you can solve them, not because I’m being lazy. (3) I will prove shortly (it is a very useful computational criterion for testing controllability). (4) I will prove later in this post. I should also note that these criteria also hold for a discrete-time system

$x_{n+1} = Ax_n + Bu_n$

$y_n = Cx_n + Du_n$

Proof of (3): In the case of a discrete-time system, if we have control inputs $u_1, \ldots, u_k$, then $x_{k+1}$ will be

$A^k x_1 + (Bu_k + ABu_{k-1} + A^2Bu_{k-2} + \ldots + A^{k-1}Bu_1)$

In particular, after $k$ time steps we can affect $x_{k+1}$ by an arbitrary linear combination of elements from the row spaces of $A^{i}B$, where $i$ ranges from $0$ to $k-1$. In other words, we can drive $x_{k+1}$ to an arbitrary state if and only if the row space of $[A^{i}B]_{i=0}^{k-1}$ is the entire state space, i.e. $[A^{i}B]_{i=0}^{k-1}$ has full row rank. So a discrete-time system is controllable if and only if $[A^{i}B]_{i=0}^{k-1}$ has full row rank for some sufficiently large $k$.

To finish the discrete-time case, we use the Cayley-Hamilton theorem, which shows that any $n \times n$ matrix satisfies a degree $n$ polynomial, and so in particular it suffices to pick $k = n$ above, since $A^nB$ can be written as a linear combination of $A^{i}B$ for $i < n$, and similarly for any larger powers of $A$.

Now we need to deal with the continuous time case. In this case, we can use the theory of linear differential equations to show that

$x(t) = x(0)e^{At} + \int_{0}^{t} e^{A\tau}Bu(t-\tau) d\tau,$

where $e^{A\tau}$ is the matrix exponential of $A\tau$. But if we use the Cayley-Hamilton theorem a second time, we see that $e^{A\tau}$ can be expressed as an $(n-1)$st degree polynomial in $A\tau$, so that there exists some $c_0(\tau), \ldots, c_{n-1}(\tau)$ such that

$x(t) =e^{At}x(0) + \sum_{k=0}^{n-1} A^kB \int_{0}^{t} c_k(\tau)u(t-\tau) d\tau.$

From here it is clear that, in order for a continuous time system to be controllable, the controllability matrix must have full row rank (since $x(t)$ is equal to $e^{At}x(0)$ plus something in the row space of the controllability matrix). The converse is less obvious. If the $c_k(\tau)$ were linearly independent functions, then we would be done, because the last term in the sum can be thought of as the inner product of $c_k(\tau)$ and $u(t-\tau)$, and we can just use Gram-Schmidt orthogonalization to show that those inner products can be chosen arbitrarily (if you don’t see this then figuring it out is a good linear algebra exercise).

The problem is that the $c_k(\tau)$ are not necessarily linearly independent. If $A$ has all distinct eigenvalues, then they will be. This is because we have the relations $e^{At}v = e^{\lambda t} v$ and $A^k v = \lambda^k v$ for any $\lambda$-eigenvector $v$ of $A$, so we can write $n$ distinct exponential functions as a linear combination of the $c_k(\tau)$, and any relation among the $c_k$ would imply a relation among the $e_{\lambda t}$, which is impossible (it is a basic result from Fourier analysis that exponential functions are linearly independent).

However, this result actually needs $A$ to have distinct eigenvalues. In particular, if one takes $A = I$, the $n \times n$ identity matrix, then you can show that all but one of the $c_k$ can be chosen arbitrarily. This is because $I$, $I^2$, $\ldots$ are all equal to each other, and thus linearly dependent.

What we need to do instead is let $m$ be the degree of the minimal polynomial $p$ such that $p(A) = 0$. Then we can actually write $e^{At}$ as $\sum_{k=0}^{m-1} d_k(t)$ for some functions $d$:

$\sum_{k=0}^{m-1} d_k(t)A^k = e^{At}$

By the way in which the $d_k$ were constructed (by applying polynomial relations to an absolutely convergent Taylor series), we know that they are all infinitely differentiable, hence we can differentiate both sides $l$ times and write

$\sum_{k=0}^{m-1} d_k^{(l)}(t) A^k = A^l e^{At}$

Now look at these derivatives from $l = 0$ to $l = m-1$. If the $d_k(t)$ were linearly dependent, their derivatives would satisfy the same relation, and therefore (by evaluating everything at $t = 0$, the matrices $A^0, A^1, \ldots, A^{m-1}$ would satisfy a linear relation, which is impossible, since then $A$ would satisfy a polynomial relation of degree less than $m$.

So, the $d_k(t)$ are linearly independent, and thus by the argument with Gram-Schmidt above we can write anything in the row space of $B, AB, \ldots, A^{m-1}B$ as

$e^{At}x(0) + \sum_{k=0}^{m-1} A^kB \int_{0}^{t} d_k(\tau)u(t-\tau) d\tau$

for any $t > 0$. So are we done? Almost. The last step we need to finish is to note that if $A$ satisfies a polynomial of degree $m$ then the row space of $[B \ AB \ \ldots \ A^{m-1}B]$ is the same as the row space of $[B \ AB \ \ldots \ A^{n-1}B]$, for $n > m$.

So, that proves the result (3) about the controllability matrix. It was a lot of work in the continuous time case, although it matches our intuition for why it should be true (taking an exponential and taking a derivative are somewhat complementary to each other, so it made sense to do so; and I think there are probably results in analysis that make this connection precise and explain why we should get the controllability result in the continuous case more or less for free).

As I said before, (4) will have to wait until later.

In addition to controllability, we have a notion of stabilizability, which means that we can influence all unstable modes of $A$. In other words, we can make sure that the system eventually converges to the origin (although not necessarily in finite time). Versions of criteria (2) and (4) exist for stabilizable systems. Criterion (2) becomes a requirement that no left eigenvector of $A$ whose eigenvalue has non-negative real part is in the left null space of $B$. Criterion (4) becomes a requirement that there exist $F$ such that $A+BF$ has only eigenvalues with negative real part.

Observers

We say that a system is observable if, for any initial state $x(0)$ and any control tape $u(t)$, it is possible in finite time to infer $x(0)$ given only $u(t)$ and the output $y(t)$. In particular, we are not given any information about the internal states $x(t)$ of the system (except through $y(t)$), although it is assumed that $A$, $B$, $C$, and $D$ are known. If we have a non-linear system

$\dot{x} = f(x,u)$

$y = g(x,u)$

then it is assumed that $f$ and $g$ are known.

It turns out that observability for a system is exactly the same as controllability for the dual system, so all the criteria from the previous section hold in a suitably dual form. One thing worth thinking about is why these results still hold for any control tape $u(t)$.

(1) There is no non-zero subspace $W$ of $V$ such that $A(W) \subset W$ and $C(W) = 0$. In other words, there is no space that doesn’t show up in the output and such that the natural dynamics of the system stay in that space.

(2) There is no right eigenvector of $A$ that is in the right null space of $C$.

(3) The matrix $\left[ \begin{array}{c} C \\ CA \\ CA^2 \\ \vdots \\ CA^{n-1} \end{array} \right]$ has full column rank.

(4) The eigenvalues of $A+LC$ can be assigned arbitrarily by an appropriate choice of $L$.

Just as the matrix $F$ from the previous section can be thought of as a linear feedback law that gives the system arbitrary eigenvalues, the matrix $L$ is part of a feedback law for something called a Luenburger observer.

Also, just as there is stabilizability for a system, meaning that we can control all of the unstable modes, there is also detectability, which means that we can detect all of the unstable modes.

Luenburger Observers

An observer is a process that estimates the state of an observable system given information about its outputs. If a system is detectable, and $L$ is such that $A+LC$ has only eigenvalues with negative real part, then consider the system

$\dot{q} = Aq+Bu+L(Cq+Du-y)$

Using the fact that $Du-y = -Cx$, we see that

$\dot{(q-x)} = (A+LC)(q-x)$, so that $q-x$ decays exponentially to zero (by the assumption on the eigenvalues of $A+LC$. Thus the dynamical system above, which is called a Luenburger observer, will asymptotically approach the true state of a system given arbitrary initial conditions.

If a system is both controllable and observable, can we design an observer and a controller that working together successfully control the system? (This question is non-trivial because the controller has to use the estimated state from the controller, rather than the actual state of the system, for feedback.) The answer is no in general, but it is yes for linear systems.

Let $F$ be such that $A+BF$ is stable and let $L$ be such that $A+LC$ is stable. (A matrix is stable if all of its eigenvalues have negative real part.) Now we will consider the system obtained by using $L$ as a Luenburger observer and $F$ as a linear feedback law. Let $e := q-x$. Then we have

$\dot{e} = (A+LC)e$

$\dot{x} = Ax+BFq = (A+BF)x + BFe$

In matrix form, this gives

$\left[ \begin{array}{c} \dot{e} \\ \dot{x} \end{array} \right] = \left[ \begin{array}{cc} A+LC & 0 \\ BF & A+BF \end{array} \right] \left[ \begin{array}{c} e \\ x \end{array} \right].$

Because of the block triangular form of the matrix, we can see that its eigenvalues are given by the eigenvalues of $A+LC$ and $A+BF$. Since $A+LC$ and $A+BF$ are both stable, so is the matrix given above, so we can successfully stabilize the above system to the origin. Of course, this is weaker than full controllability. However, if we have full controllability and observability, then we can set the eigenvalues of the above matrix arbitrarily, which should imply full controllability (I haven’t sat down and proved this rigorously, though).

So, now we know how to stabilize a linear system if it is detectable and stabilizable. The main thing to take away from this is the fact that the poles of the coupled dynamics of state and observation error are exactly the eigenvalues of $A+BF$ and $A+LC$ considered individually.

State-space representations

The final topic I’d like to talk about in this post is state-space representations of transfer functions. It is here that I will prove all of the results that I promised to take care of later. There are plenty more topics in linear control theory, but I’ve been writing this post for a few days now and it’s at a good stopping point, so I’ll leave the rest of the topics for a later post.

A state-space representation of a transfer function is exactly what it sounds like. Given a transfer function $P(s)$ from $U$ to $Y$, find a state-space model

$\dot{x} = f(x,u)$

$y = g(x,u)$

that has $P$ as a transfer function. We’ll be concerned with linear state-space representations only.

The first thing to note is that a linear state-space representation of $P(s)$ can always be reduced to a smaller representation unless the representation is both controllable and observable (by just restricting to the controllable and observable subspace).

The next thing to note is that, since the transfer function of a state-space representation is $C(sI-A)^{-1}B+D$, a transfer function $P(s)$ has an irreducible (in the sense of the preceding paragraph) linear state-space representation of degree $n$ if and only if $P(s) = \frac{q(s)}{r(s)}$, where $q$ and $r$ are polynomials with $\deg(q) \leq \deg(r) = n$. Thus all controllable and observable linear state-space representations of $P(s)$ have the same dimension, and therefore there exists some non-canonical vector space isomorphism such that we can think of any two such representations as living in the same state space (though possibly with different matrices $A$, $B$, $C$, and $D$).

Finally, if two state-space representations over the same vector space have the same transfer function, then one can be obtained from the other by a chance of coordinates. I will now make this more precise and also prove it.

Claim: Suppose that $R_1$ and $R_2$ are two (not necessarily linear) state-space representations with the same input-output mapping. If $R_1$ is controllable and $R_2$ is observable, then there is a canonical map from the state space of $R_1$ to the state space of $R_2$. If $R_1$ is observable, then this map is injective. If $R_2$ is controllable, then this map is surjective. If $R_1$ and $R_2$ are both linear representations, then the map is linear.

Proof: Let the two representations be $\dot{x_1} = f_1(x_1,u), y_1 = g_1(x_1,u)$ and $\dot{x_2} = f_2(x_2,u), y_2 = g_2(x_2,u)$.

Since $R_1$ is controllable, we can take an input tape that sends $x_1$ to an arbitrary state $x$ at some time $t_0$. Then by looking at $y_2$ evolve under the same input tape, by the observability of $R_2$ we will eventually be able to determine $x_2(t_0)$ uniquely. The canonical map sends the $x$ we chose to $x_2(t_0)$.The fact that $y_1(t) = y_2(t)$ for all $t$ guarantees that $x_2(t_0)$ is well-defined (i.e., it doesn’t matter what $u$ we choose to get there).

If $R_2$ is controllable, then we can choose a $u$ that causes us to end up with whatever $x_2(t_0)$ we choose, which implies that the map is surjective. Now for the purposes of actually computing the map, we can always assume that the control input becomes $0$ once we get to the desired $x_1(t_0)$. Then there is a one-to-one correspondence between possible output tapes after time $t_0$ and possible values of $x_2(t_0)$. If $R_1$ is observable, this is also true for $x_1(t_0)$, which implies injectivity. I will leave it to you to verify that the map is linear if both representations are linear.

Finally, I would like to introduce a special case of controllable canonical form and use it to prove criterion (4) about controllability. It will also show, at least in a special case, that any transfer function that is a quotient of two polynomials (where the denominator has at least as high degree as the numerator) has a linear state-space representation.

The special case is when $U$ is one-dimensional. Then our transfer matrix can be written in the form

$p(s) = \frac{\vec{c_1}s^{n-1}+\vec{c_2}s^{n-2}+\ldots+\vec{c_n}}{s^n+a_1s^{n-1}+\ldots+a_n}+\vec{d}$

It turns out that this transfer function can be represented by the following transfer matrix:

$A = \left[ \begin{array}{ccccc} -a_1 & -a_2 & \ldots & -a_{n-1} & -a_n \\ 1 & 0 & \ldots & 0 & 0 \\ 0 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \ldots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 & 0 \end{array} \right], B = \left[ \begin{array}{c} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right]$

$C = \left[ \begin{array}{ccccc} \vec{c_1} & \vec{c_2} & \cdots & \vec{c_{n-1}} & \vec{c_n} \end{array} \right], D = \vec{d}$

This might seem a bit contrived, but the construction for $A$ is a nice trick for constructing a matrix with a given characteristic polynomial. Also note that $A$ will have a single Jordan block for each distinct eigenvalue (whose size is the number of times that eigenvalue appears in the list $\lambda_1, \ldots, \lambda_n$). One can show directly that this is a necessary and sufficient condition for being controllable by a single input.

I will leave it to you to check the details that the above state-space model actually has $P(s)$ as a transfer function. (Bonus question: what is the equivalent observable canonical form for observable single-output systems?) I will wrap up this post by proving criterion (4) about controllability, as promised. I have reproduced it below for convenience:

(4) An LTI system is controllable if and only if we can assign the eigenvalues of $A+BF$ arbitrarily by a suitable choice of $F$.

Proof: I will prove the “only if” direction, since that is the difficult direction. First consider the case when we have a single-input system. Then take the transfer function from $u$ to $x$ (this is the same as assuming that $C = I$, $D = 0$). By the result above and the assumption of controllability, there exists a system with the same transfer function in controllable canonical form, and thus there is a change of coordinates that puts our system in controllable canonical form. Once we are in canonical form, it is easy to see that by choosing $F = \left[ \begin{array}{ccccc} -b_1 & -b_2 & \ldots & -b_{n-1} & -b_n \end{array} \right]$, we end up with a system whose characteristic polynomial is $\lambda^n + (a_1+b_1)\lambda^{n-1} + \ldots + (a_{n-1}+b_{n-1})\lambda + (a_n+b_n)$. We can therefore give $A+BF$ an arbitrary characteristic polynomial, and thus choose its eigenvalues arbitrarily.

This proves the desired result in the case when we have a single input to our system. When we have multiple inputs, we have to consider them one-by-one, and use the fact that linear feedback can’t affect the eigenvalues of the parts of the system that are outside the controllable subspace. I haven’t checked this approach very carefully, so it might not work, but I am pretty sure it can be made to work. If you want more details, feel free to ask me and I will provide them. At this point, though, I’m writing more of a treatise than a blog post, so I really think I should cut myself off here. I hope the exposition hasn’t suffered at all from this, but if it has, feel free to call me on it and I will clarify myself.

My next post will take a break from linear control and tell you why using least squares is one of the worst ideas ever (because you think it will work when it actually won’t; if you don’t believe me I’ll show you how negligible sampling errors can easily cause you to be off by 10 percent in your model parameters).

The Underwater Cartpole

2010-06-26T00:00:00-07:00

My last few posts have been rather abstract. I thought I’d use this one to go into some details about the actual system we’re working with.

As I mentioned before, we are looking at a cart pole in a water tunnel. A cart pole is sometimes also called an inverted pendulum. Here is a diagram from wikipedia:

The parameter we have control over is F, the force on the cart. We would like to use this to control both the position of the cart and the angle of the pendulum. If the cart is standing still, the only two possible fixed points of the system are $\theta = 0$ (the bottom, or “downright”) and $\theta = \pi$ (the “upright”). Since $\theta = 0$ is easy to get to, we will be primarily interested with getting to $\theta = \pi$.

For now, I’m just going to worry about the regular cart pole system, without introducing any fluid dynamics. This is because the fluid dynamics are complicated, even with a fairly rough model (called the Quasi-steady Model), and I don’t know how to derive them anyway. Before continuing, it would be nice to have an explicit parametrization of the system. There are two position states we care about: $x$, the cart position; and $\theta$, the pendulum angle, which we will set to $0$ at the bottom with the counter-clockwise direction being positive. I realize that this is not what the picture indicates, and I apologize for any confusion. I couldn’t find any good pictures that parametrized it the way I wanted, and I’m going to screw up if I use a different parametrization than what I’ve written down.

At any rate, in addition to the two position states $x$ and $\theta$, we also care about the velocity states $\dot{x}$ and $\dot{\theta}$, so that we have four states total. For convenience, we’ll also name a variable $u := \frac{F}{M}$, so that we have a control input $u$ that directly affects the acceleration of the cart. We also have system parameters $M$ (the mass of the cart), $g$ (the acceleration due to gravity), $l$ (the length of the pendulum arm), and $I$ (the inertia of the pendulum arm). With these variables, we have the following equations of motion:

$\left[ \begin{array}{c} \dot{x} \\ \dot{\theta} \\ \ddot{x} \\ \ddot{\theta} \end{array} \right] = \left[ \begin{array}{c} \dot{x} \\ \dot{\theta} \\ 0 \\ -\frac{mgl\sin(\theta)}{I} \end{array} \right] + \left[ \begin{array}{c} 0 \\ 0 \\ 1 \\ -\frac{mg\cos(\theta)}{I} \end{array} \right] u$

You will note that the form of these equations is different from in my last post. This is because I misspoke last time. The actual form we should use for a general system is

$\dot{x} = f(x) + B(x)u,$

or, if we are assuming a second-order system, then

$\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{c} \dot{q} \\ f(q,\dot{q}) \end{array} \right] + B(q,\dot{q}) u.$

Here we are assuming that the natural system dynamics can be arbitrarily non-linear in $x$, but the effect of control is still linear for any fixed system state (which, as I noted last time, is a pretty safe assumption). The time when we use the form $\dot{x} = Ax + Bu$ is when we are talking about a linear system — usually a linear time-invariant system, but we can also let $A$ and $B$ depend on time and get a linear time-varying system.

I won’t go into the derivation of the equations of motion of the above system, as it is a pretty basic mechanics problem and you can find the derivation on Wikipedia if you need it. Instead, I’m going to talk about some of the differences between this system and the underwater system, why this model is still important, and how we can apply the techniques from the last two posts to get a good controller for this system.

Differences from the Underwater System

In the underwater system, instead of having gravity, we have a current (the entire system is on the plane perpendicular to gravity). I believe that the effect of current is much the same as the affect of gravity (although with a different constant), but that could actually be wrong. At any rate, the current plays the role that gravity used to play in terms of defining “up” and “down” for the system (as well as creating a stable fixed point at $\theta = 0$ and an unstable fixed point at $\theta = \pi$).

More importantly, there is significant drag on the pendulum, and the drag is non-linear. (There is always some amount of drag on a pendulum due to friction of the joint, but it’s usually fairly linear, or at least easily modelled.) The drag becomes the greatest when $\theta = \pm \frac{\pi}{2}$, which is also the point at which $u$ becomes useless for controlling $\theta$ (note the $\cos(\theta)$ term in the affect of $u$ on $\ddot{\theta}$). This means that getting past $\frac{\pi}{2}$ is fairly difficult for the underwater system.

Another difference is that high accelerations will cause turbulence in the water, and I’m not sure what affect that will have. The model we’re currently using doesn’t account for this, and I haven’t had a chance to experiment with the general fluid model (using PDEs) yet.

Why We Care

So with all these differences, why am I bothering to give you the equations for the regular (not underwater) system? More importantly, why would I care about them for analyzing the actual system in question?

I have to admit that one of my reasons is purely pedagogical. I wanted to give you a concrete example of a system, but I didn’t want to just pull out a long string of equations from nowhere, so I chose a system that is complex enough to be interesting but that still has dynamics that are simple to derive. However, there are also better reasons for caring about this system. The qualitative behaviour of this system can still be good for giving intuition about the behaviour of the underwater system.

For instance, one thing we want to be able to do is swing-up. With limited magnitudes of acceleration and a limited space (in terms of $x$) to perform maneuvers in, it won’t be possible in general to perform a swing-up. However, there are various system parameters that could make it easier or harder to perform the swing-up. For instance, will increasing $I$ (the inertia of the pendulum) make it easier or harder to perform a swing-up? (You should think about this if you don’t know the answer, so I’ve provided it below the fold.)

The answer is that higher inertia makes it easier to perform a swing-up (this is more obvious if you think about the limiting cases of $I \to 0$ and $I \to \infty$). The reason is that a higher moment of inertia makes it possible to store more energy in the system at the same velocity. Since the drag terms are going to depend on velocity and not energy, having a higher inertia means that we have more of a chance of building up enough energy to overcome the energy loss due to drag and get all the way to the top.

In general, various aspects of the regular system will still be true in a fluid on the proper time scales. I think one thing that will be helpful to do when we start dealing with the fluid mechanics is to figure out exactly which things are true on which time scales.

What we’re currently using this system for is the base dynamics of a high-gain observer, which I’ll talk about in a post or two.

I apologize for being vague on these last two justifications. The truth is that I don’t fully understand them myself. The first one will probably have to wait until I start toying with the full underwater system; the second (high-gain observers) I hope to figure out this weekend after I check out Khalil’s book on control from Barker Library.

Hopefully, though, I’ve at least managed somewhat to convince you that the dynamics of this simpler system can be informative for the more complicated system.

Controlling the Underwater Cartpole

Now we finally get to how to control the underwater cartpole. Our desired control task is to get to the point $\left[ \begin{array}{cccc} 0 & \pi & 0 & 0 \end{array} \right]$. That is, we want to get to the unstable fixed point at $\theta = \pi$. In the language of my last post, if we wanted to come up with a good objective function $J$, we could say that $J$ is equal to the closest we ever get to $\theta = \pi$ (assuming we never pass it), and if we do get to $\theta = \pi$ then it is equal to the smallest velocities we ever get as we pass $\theta = \pi$; also, $J$ is equal to infinity if $x$ ever gets too large (because we run into a wall), or if $u$ gets too large (because we can only apply a finite amount of acceleration).

You will notice that I am being pretty vague about how exactly to define $J$ (my definition above wouldn’t really do, as it would favor policies that just barely fail to get to $\theta = \pi$ over policies that go past it too quickly, which we will see is suboptimal). There are two reasons for my vagueness – first, there are really two different parts to the control action — swing-up and balancing. Each of these parts should really have its own cost function, as once you can do both individually it is pretty easy to combine them. Secondly, I’m not really going to care all that much about the cost function for what I say below. I did have occasion to use a more well-defined cost function for the swing-up when I was doing learning-based control, but this didn’t make its way (other than by providing motivation) into the final controller.

I should point out that the actual physical device we have is more velocity-limited than acceleration-limited. It can apply pretty impressive accelerations, but it can also potentially damage itself at high velocities (by running into a wall too quickly). We can in theory push it to pretty high velocities as well, but I’m a little bit hesitant to do so unless it becomes clearly necessary, as breaking the device would suck (it takes a few weeks to get it repaired). As it stands, I haven’t (purposely) run it at higher velocities than 1.5 meters/sec, which is already reasonably fast if you consider that the range of linear motion is only 23.4 cm.

But now I’m getting sidetracked. Let’s get back to swing-up and balancing. As I said, we can really divide the overall control problem into two separate problems of swing-up and balancing. For swing-up, we just want to get enough energy into the system for it to get up to $\theta = \pi$. We don’t care if it’s going too fast at $\theta = \pi$ to actually balance. This is because it is usually harder to add energy to a system than to remove energy, so if we’re in a situation where we have more energy than necessary to get to the top, we can always just perform the same control policy less efficiently to get the right amount of energy.

For balancing, we assume that we are fairly close to the desired destination point, and we just want to get the rest of the way there. As I mentioned last time, balancing is generally the easier of the two problems because of LQR control.

In actuality, these problems cannot be completely separated, due to the finite amount of space we have to move the cart in. If the swing up takes us to the very edge of the available space, then the balancing controller might not have room to actually balance the pendulum.

Swing-up

I will first go in to detail on the problem of swing-up. The way I think about this is that the pendulum has some amount of energy, and that energy gets sapped away due to drag. In the underwater case, the drag is significant enough that we really just want to add as much energy as possible. How can we do this? You will recall from classical mechanics that the faster an object is moving, the faster you can add energy to that object. Also, the equations of motion show us that an acceleration in $x$ has the greatest effect on $\dot{\theta}$ when $\cos(\theta)$ is largest, that is, when $\theta = 0$ or $\theta = \pi$. At the same time, we expect the pendulum to be moving fastest when $\theta = 0$, since at that point it has the smallest potential energy, and therefore (ignoring energy loss due to drag), the highest kinetic energy. So applying force will always be most useful when $\theta = 0$.

Now there is a slight problem with this argument. The problem is that, as I keep mentioning, the cart only has a finite distance in which to move. If we accelerate the cart in one direction, it will keep moving until we again accelerate it in the opposite direction. So even though we could potentially apply a large force at $\theta = 0$, we will have to apply a similarly large force later, in the opposite direction. I claim, however, that the following policy is still optimal: apply a large force at $\theta = 0$, sustain that force until it becomes necessary to decelerate (to avoid running into a wall), then apply a large decelerating force. I can’t prove rigorously that this is the optimal strategy, but the reasoning is that this adds energy when $\cos(\theta)$ is changing the fastest, so by the time we have to decelerate and remove energy $\cos(\theta)$ will be significantly smaller, and therefore our deceleration will have less effect on the total energy.

To do the swing-up, then, we just keep repeating this policy whenever we go past $\theta = 0$ (assuming that we can accelerate in the appropriate direction to add energy to the system). The final optimization is that, once we get past $|\theta| = \frac{\pi}{2}$, the relationship between $\ddot{x}$ and $\ddot{\theta}$ flips sign, and so we would like to apply the same policy of rapid acceleration and deceleration in this regime as well. This time, however, we don’t wait until we get to $\theta = \pi$, as at that point we’d be done. Instead, we should perform the energy pumping at $\dot{\theta} = 0$, which will cause $\dot{\theta}$ to increase above $0$ again, and then go in the opposite direction to pump more energy when $\dot{\theta}$ becomes $0$ for the second time.

I hope that wasn’t too confusing of an explanation. When I get back to lab on Monday, I’ll put up a video of a matlab simulation of this policy, so that it’s more clear what I mean. At any rate, that’s the idea behind swing-up: use up all of your space in the $x$-direction to pump energy into the system at maximum acceleration, doing so at $\theta = 0$ and when $\dot{\theta} = 0$ and we are past $|\theta| = \frac{\pi}{2}$. Now, on to balancing.

Balancing

As I mentioned, if we have a good linear model of our system, we can perform LQR control. So the only real problem here is to get a good linear model. To answer Arvind’s question from last time, if we want good performance out of our LQR controller, we should also worry about the cost matrices $Q$ and $R$; for this system, the amount of space we have to balance (23.4cm, down to 18cm after adding in safeties to avoid hitting the wall) is small enough that it’s actually necessary to worry about $Q$ and $R$ a bit, which I’ll get to later.

First, I want to talk about how to get a good linear model. To balance, we really want a good linearization about $\theta = \pi$. Unfortunately, this is an unstable fixed point so it’s hard to collect data around it. It’s easier to instead get a good linearization about $\theta = 0$ and then flip the signs of the appropriate variables to get a linear model about $\theta = \pi$. My approach to getting this model was to first figure out what it would look like, then collect data, and finally do a least squares fit on that data.

Since we can’t collect data continuously, we need a discrete time linear model. This will look like

$x_{n+1} = Ax_n + Bu_n$

In our specific case, $A$ and $B$ will look like this:

$\left[ \begin{array}{c} \theta_{n+1} \\ y_{n+1} \\ \dot{theta}_{n+1} \\ \dot{y}_{n+1} \end{array} \right] = \left[ \begin{array}{cccc} 1 & 0 & dt & 0 \\ 0 & 1 & 0 & dt \\ c_1 & 0 & c_2 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} \theta_n \\ y_n \\ \dot{\theta}_n \\ \dot{y}_n \end{array} \right] + \left[ \begin{array}{c} 0 \\ 0 \\ c_3 \\ dt \end{array} \right]$

I got this form by noting that we definitely know how $\theta$, $y$, and $\dot{y}$ evolve with time, and the only question is what happens with $\dot{\theta}$. On the other hand, clearly $\dot{\theta}$ cannot depend on $y$ or $\dot{y}$ (since we can set them arbitrarily by choosing a different inertial reference frame). This leaves only three variables to determine.

Once we have this form, we need to collect good data. The important thing to make sure of is that the structure of the data doesn’t show up in the model, since we care about the system, not the data. This means that we don’t want to input something like a sine or cosine wave, because that will only excite a single frequency of the system, and a linear system that is given something with a fixed frequency will output the same frequency. We should also avoid any sort of oscillation about $x = 0$, or else our model might end up thinking that it’s supposed to oscillate about $x = 0$ in general. I am sure there are other potential issues, and I don’t really know much about good experimental design, so I can’t talk much about this, but the two issues above are ones that I happened to run into personally.

What I ended up doing was taking two different functions of $x$ that had a linearly increasing frequency, then differentiating twice to get acceleration profiles to feed into the system. I used these two data sets to do a least squares fit on $c_1$, $c_2$, and $c_3$, and then I had my model. I transformed by discrete time model into a continuous time model (MATLAB has a function called d2c that can do this), inverted the appropriate variables, and got a model about the upright ($\theta = \pi$).

Now the only problem was how to choose $Q$ and $R$. The answer was this: I made $R$ fairly small ($0.1$), since we had a very strong actuator so large accelerations were fine. Then, I made the penalties on position larger than the penalties on velocity (since position is really what we care about). Finally, I thought about the amount that I would want the cart to slide to compensate for a given disturbance in $\theta$, and used this to choose a ratio between costs on $\theta$ and costs on $x$. In the end, this gave me $Q = \left[ \begin{array}{cccc} 40 & 0 & 0 & 0 \\ 0 & 10 & 0 & 0 \\ 0 & 0 & 4 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right]$.

I wanted to end with a video of the balancing controller in action, but unfortunately I can’t get my Android phone to upload video over the wireless, so that will have to wait.

Linear Control Theory: Part 0

2010-06-20T00:00:00-07:00

The purpose of this post is to introduce you to some of the basics of control theory and to introduce the Linear-Quadratic Regulator, an extremely good hammer for solving stabilization problems.

To start with, what do we mean by a control problem? We mean that we have some system with dynamics described by an equation of the form

$\dot{x} = Ax,$

where $x$ is the state of the system and $A$ is some matrix (which itself is allowed to depend on $x$). For example, we could have an object that is constrained to move in a line along a frictionless surface. In this case, the system dynamics would be

$\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{cc} 0 & 1 \\ 0 & 0 \end{array} \right]\left[ \begin{array}{c} q \\ \dot{q} \end{array} \right]. $

Here $q$ represents the position of the object, and $\dot{q}$ represents the velocity (which is a relevant component of the state, since we need it to fully determine the future behaviour of the system). If there was drag, then we could instead have the following equation of motion:

$\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{cc} 0 & 1 \\ 0 & -b \end{array} \right]\left[ \begin{array}{c} q \\ \dot{q} \end{array} \right], $

where $b$ is the coefficient of drag.

If you think a bit about the form of these equations, you will realize that it is both redundant and not fully general. The form is redundant because $A$ can be an arbitrary function of $x$, yet it also acts on $x$ as an argument, so the equation $\ddot{q} = q\dot{q}$, for example, could be written as

$\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{cc} 0 & 1 \\ \alpha \dot{q} & (1-\alpha) q \end{array} \right] \left[ \begin{array}{c} q \\ \dot{q} \end{array} \right]$

for any choice of $\alpha$. On the other hand, this form is also not fully general, since $x = 0$ will always be a fixed point of the system. (We could in principle fix this by making $\dot{x}$ affine, rather than linear, in $x$, but for now we’ll use the form given here.)

So, if this representation doesn’t uniquely describe most systems, and can’t describe other systems, why do we use it? The answer is that, for most systems arising in classical mechanics, the equations naturally take on this form (I think there is a deeper reason for this coming from Lagrangian mechanics, but I don’t yet understand it).

Another thing you might notice is that in both of the examples above, $x$ was of the form $\left[ \begin{array}{c} q \\ \dot{q} \end{array} \right]$. This is another common phenomenon (although $q$ and $\dot{q}$ may be vectors instead of scalars in general), owing to the fact that Newtonian mechanics produces second-order systems, and so we care about both the position and velocity of the system.

So, now we have a mathematical formulation, as well as some notation, for what we mean by the equations of motion of a system. We still haven’t gotten to what we mean by control. What we mean is that we assume that, in addition to the system state $x$, we have a control input $u$ (usually we can choose $u$ independently from $x$), such that the actual equations of motion satisfy

$\dot{x} = Ax+Bu,$

where again, $A$ and $B$ can both depend on $x$. What this really means physically is that, for any configuration of the system, we can choose a control input $u$, and $u$ will affect the instantaneous change in state in a linear manner. We normally call each of the entries of $u$ a torque.

The assumption of linearity might seem strong, but it is again true for most systems, in the sense that a linear increase in a given torque will induce a linear response in the kinematics of the system. But note that this is only true once we talk about mechanical torques. If we think of a control input as an electrical signal, then the system will usually respond non-linearly with respect to the signal. This is simply because the actuator itself provides a force that is non-linear with its electrical input.

We can deal with this either by saying that we only care about a local model, and the actuator response is locally linear to its input; or, we can say that the problem of controlling the actuator itself is a disjoint problem that we will let someone worry about. In either case, I will shamelessly use the assumption that the system response is linear in the control input.

So, now we have a general form for equations of motion with a control input. The general goal of a control problem is to pick a function $f(x,t)$ such that if we let $u = f(x,t)$ then the trajectory $X(t)$ induced by the equation $\dot{x} = Ax+Bf(x,t)$ minimizes some objective function $J(X,f)$. Sometimes our goals are more modest and we really just want to get to some final state, in which case we can make $J$ just be a function of the final state that assigns a score based on how close we end up to the target state. We might also have hard constraints on $u$ (because our actuators can only produce a finite amount of torque), in which case we can make $J$ assign an infinite penalty to any $f$ that violates these constraints.

As an examples, let’s return to our first example of an object moving in a straight line. This time we will say that $\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[ \begin{array}{cc} 0 & 1 \\ 0 & 0 \end{array} \right] \left[ \begin{array}{c} q \\ \dot{q} \end{array} \right]+\left[ \begin{array}{c} 0 \\ 1 \end{array} \right]u$, with the constraint that $|u| \leq A$. We want to get to $x = \left[ \begin{array}{c} 0 \\ 0 \end{array} \right]$ as quickly as possible, meaning we want to get to $q = 0$ and then stay there. We could have $J(X,f)$ just be the amount of time it takes to get to the desired endpoint, with a cost of infinity on any $f$ that violates the torque limits. However, this is a bad idea, for two reasons.

The first reason is that, numerically, you will never really end up at exactly $\left[ \begin{array}{c} 0 \\ 0 \end{array} \right]$, just very close to it. So if we try to use this function on a computer, unless we are particularly clever we will assign a cost of $\infty$ to every single control policy.

However, we could instead have $J(X,f)$ be the amount of time it takes to get close to the desired endpoint. I personally still think this is a bad idea, and this brings me to my second reason. Once you come up with an objective function, you need to somehow come up with a controller (that is, a choice of $f$) that minimizes that objective function, or at the very least performs reasonably well as measured by the objective function. You could do this by being clever and constructing such a controller by hand, but in many cases you would much rather have a computer find the optimal controller. If you are going to have a computer search for a good controller, you want to make the search problem as easy as possible, or at least reasonable. This means that, if we think of $J$ as a function on the space of control policies, we would like to make the problem of optimizing $J$ tractable. I don’t know how to make this precise, but there are a few properties we would like $J$ to satisfy — there aren’t too many local minima, and the minima aren’t approached too steeply (meaning that there is a reasonable large neighbourhood of small values around each local minimum). If we choose an objective function that assigns a value of $\infty$ to almost everything, then we will end up spending most of our time wading through a sea of infinities without any direction (because all directions will just yield more values of $\infty$). So a very strict objective function will be very hard to optimize. Ideally, we would like a different choice of $J$ that has its minimum at the same location but that decreases gradually to that minimum, so that we can solve the problem using gradient descent or some similar method.

In practice, we might have to settle for an objective function that only is trying to minimize the same thing qualitatively, rather than in any precise manner. For example, instead of the choice of $J$ discussed above for the object moving in a straight line, we could choose

$J(X,f) = \int_{0}^{T} |q(t)|^2 dt,$

where $T$ is some arbitrary final time. In this form, we are trying to minimize the time-integral of some function of the deviation of $q$ from $0$. With a little bit of work, we can deduce that, for large enough $T$, the optimal controller is a bang-bang controller that accelerates towards $0$ at the greatest rate possible, until accelerating any more would cause the object to overshoot $q = 0$, at which point the controller should decelerate at the greatest rate possible (there are some additional cases for when the object will overshoot the origin no matter what, but this is the basic idea).

This brings us to my original intention in making this post, which is LQR (linear-quadratic regulator) control. In this case, we assume that $A$ and $B$ are both constant and that our cost function is of the form

$J(X,f) = \int_{0}^{\infty} X(t)^{T}QX(t) + f(X(t),t)^{T}Rf(X(t),t) dt,$

where the $T$ means transpose and $Q$ and $R$ are both positive definite matrices. In other words, we assume that our goal is to get to $x = 0$, and we penalize both our distance from $x = 0$ and the amount of torque we apply at each point in time. If we have a cost function of this form, then we can actually solve analytically for the optimal control policy $f$. The solution involves solving the Hamilton-Bellman-Jacobi equations, and I won’t go into the details, but when the smoke clears we end up with a linear feedback policy $u = -Kx$, where $K = R^{-1}B^{T}P$, and $P$ is given by the solution to the algebraic Riccati equation

$A^TP+PA-PBR^{-1}B^TP+Q=0.$

What’s even better is that MATLAB has a built-in function called lqr that will set up and solve the Riccati equation automatically.

You might have noticed that we had to make the assumption that both $A$ and $B$ were constant, which is a fairly strong assumption, as it implies that we have a LTI (linear time-invariant) system. So what is LQR control actually good for? The answer is stabilization. If we want to design a controller that will stabilize a system about a point, we can shift coordinates so that the point is at the origin, then take a linear approximation about the origin. As long as we have a moderately accurate linear model for the system about that point, the LQR controller will successfully stabilize the system to that point within some basin of attraction. More technically, the LQR controller will make the system locally asymptotically stable, and the cost function $J$ for the linear system will be a valid local Lyapunov function.

Really, the best reason to make use of LQR controllers is that they are a solution to stabilization problems that work out of the box. Many controllers that work in theory will actually require a ton of tuning in practice; this isn’t the case for an LQR controller. As long as you can identify a linear system about the desired stabilization point, even if your identification isn’t perfect, you will end up with a pretty good controller.

I was thinking of also going into techniques for linear system identification, but I think I’ll save that for a future post. The short answer is that you find a least-squares fit of the data you collect. I’ll also go over how this all applies to the underwater cart-pole in a future post.

Robotics

2010-06-18T00:00:00-07:00

This summer I am working in the Robotics Locomotion group at CSAIL (MIT’s Computer Science and Artificial Intelligence Laboratory). I’ve decided to start a blog to exposit on the ideas involved. This ranges from big theoretical ideas (like general system identification techniques) to problem-specific ideas (specific learning strategies for the system we’re interested in) to useful information on using computational tools (how to make MATLAB’s ode45 do what you want it to).

To start with, I’m going to describe the problem that I’m working on, together with John (a grad student in mechanical engineering).

Last spring, I took 6.832 (Underactuated Robotics) at MIT. In that class, we learned multiple incredibly powerful techniques for nonlinear control. After taking it, I was more or less convinced that we could solve, at least off-line, pretty much any control problem once it was posed properly. After coming to the Locomotion group, I realized that this wasn’t quite right. What is actually true is that we can solve any control problem where we have a good model and a reasonable objective function (we can also run into problems in high dimensions, but even there you can make progress if the objective function is nice enough).

So, we can (almost) solve any problem once we have a good model. That means the clear next thing to do is to come up with really good modelling techniques. Again, this is sort of true. There are actually three steps to constructing a good controller: experimental design, system identification, and a control policy.

System identification is the process of building a good model for your system given physical data from it. But to build a good model, you need good data. That’s where experimental design comes in. Many quick and dirty ways of collecting data (like measuring the response of a sinusoidal input) introduce flaws into the model (which cannot be fixed except by collecting more data). I will explain these issues in more detail in a later post. For simple systems, you can get away with still-quick but slightly-less-dirty methods (such as a chirp input), but for more general systems you need better techniques. Ian (a research scientist in our lab) has a nice paper on this topic that involves semidefinite optimization.

Designing a control policy we have already discussed. It is the process of, once we have collected data and built a model, designing an algorithm that will guide our system to its desired end state.

So, we have three tasks — experimental design, system identification, and control policy. If we can do the first two well, then we already know how to do the third. So one solution is to do a really good job on the experimental design and system identification so that we can use our sophisticated control techniques. This is what Michael and Zack are working on with a walking robot. Zack has spent months building the robot in such a way that it will obey all of our idealized models and behave nicely enough that we can run LQR-trees on it.

Another solution is to give up on having a perfect model and have a system identification algorithm that, instead of returning a single model, returns an entire space of models (for example, by giving an uncertainty on each parameter). Then, as long as we can build a controller that works for every system in this space, we’ll be good to go. This can be done by using techniques from robust control.

A final idea is to give up on models entirely and try to build a controller that relies only on the qualitative behaviour of the system in question (for example, by using model-free learning techniques). This is what I am working on. More specifically, I’m working on control in fluids with reasonably high Reynold’s number. Unless you can solve the Navier-Stokes equations, you can’t hope to get a model for this system, so you’ll have to do something more clever.

The first system we’re working with is the underwater cart-pole system. This involves a pendulum attached to a cart. The pendulum itself is not actuated (meaning there is no motor to swing it up). Instead, the only way to control the pendulum is indirectly, by moving the cart around (the cart is constrained to move in a line). The fluid dynamics enter when we put the pendulum in a water tunnel and replace the arm of the pendulum with a water foil.

When the pendulum is in a constant-velocity stream, the system becomes a cart and pendulum with non-linear damping. However, when we add objects to the stream, the objects shed vortices and the dynamics become too complicated to model with an ordinary differential equation. Instead, we need to simulate the solution to a partial differential equation, which is significantly more difficult computationally.

Our first goal is to design a controller that will stabilize the pendulum at the top in the case of a constant stream (we have already done this — more about that in a later post). Our next goal is to design a controller to swing the pendulum up to the top, again in a constant stream (this is what I hope to finish tomorrow — again, more details in a later post). Once we have these finished, the more interesting work begins — to accomplish the same tasks in the presence of vortices. If we were to use existing ideas, we would design a robust version of the controller for a constant stream, and treat the vortices as disturbances. And this will probably be the first thing we do, so that we have a standard of comparison. But there are many examples in nature of animals using vortices to their advantage. So our ultimate goal is to do the same in this simple system. Since vortices represent lots of extra energy in the water, our hope is to actually pull energy out of the vortex to aid in the swing-up task, thus actually using less energy than would be needed without vortices (if this sounds crazy, consider that dead trout can swim upstream using vortices).

Hopefully this gives you a good overview of my project. This is my first attempt at maintaining a research blog, so if you have any comments to help improve the exposition, please give them to me. Also, if you’d like me to elaborate further on anything, let me know. I’ll hopefully have a post or two going into more specific details this weekend.

Jacob Steinhardt

New Blog Location

How Much Do Recommender Systems Drive Polarization?

Definitions

Sources

To What Extent Does Social Media Cause Polarization?

Not Likely the Major Cause

Are We Abnormally Online?

Could Social Media be an Accelerant?

Conclusion

Economic AI Safety

Film Study for Research

Film Study

Act-Reflect-Ask

Summary

Donations for 2019/2020

Global Health and Development

Long-Term Future

Miscellaneous

Thoughts on Overall Allocation

Other Notes

Measurement, Optimization, and Take-off Speed

Measurement Is Great

Historical Support

Philosophical Support

Personal Anecdotes

Optimization Power

Outer Optimization

Inner Optimization

Sets with Small Intersection

Advice for Authors

Model Mis-specification and Inverse Reinforcement Learning

Specific Pitfalls for Inverse Reinforcement Learning

Inverse Reinforcement Learning: Definition and Notations

Recognizing Human Actions in Data

Information and Biases

Long-term Plans

Learning Values != Robustly Predicting Human Behaviour

Further reading

Acknowledgments

Linear algebra fact

Prékopa–Leindler inequality

Latent Variables and Model Mis-specification

1 Identifying Parameters in Regression Problems

2 Structural Equation Models

3 A Possible Solution: Counterfactual Reasoning

Individual Project Fund: Further Details

Donations for 2016

Thinking Outside One’s Paradigm

Two Strange Facts

Difficulty of Predicting the Maximum of Gaussians

Maximal Maximum-Entropy Sets

Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems

Introduction

Outline

Ordinary Engineering

Extraordinary Engineering

Long-term Goals, Near-term Research

Related Work

Conclusion

Acknowledgments

References

A Fervent Defense of Frequentist Statistics

Another Critique of Effective Altruism

Another Critique of Effective Altruism

Convex Conditions for Strong Convexity

Convexity counterexample

Probabilistic Abstractions I

Pairwise Independence vs. Independence

A Fun Optimization Problem

Eigenvalue Bounds

Local KL Divergence

Quadratically Independent Monomials

Exponential Families

1. Exponential Families

2. Sufficient Statistics

3. Moments of an Exponential Family

4. Conjugate Priors

Gaussian-Gaussian

Beta-Bernoulli