Generalizing Across Categories
Published:
Humans are very good at correctly generalizing rules across categories (at least, compared to computers). In this post I will examine mechanisms that would allow us to do this in a reasonably rigorous manner. To this end I will present a probabilistic model such that conditional inference on that model leads to generalization across a category.
There are three questions along these lines that I hope to answer:
- How does one generalize rules across categories?
- How does one determine which rules should generalize across which categories?
- How does one determine when to group objects into a category in the first place?
I suspect that the mechanisms for each of these is rather complex, but I am reasonably confident that the methods I present make up at least part of the actual answer. A good exercise is to come up with examples where these methods fail.
Generalizing Across Categories
For simplicity I’m going to consider just some sort of binary rule, such as the existence of an attribute. So as an example, let’s suppose that we see a bunch of ducks, which look to varying degrees to be mallards. In addition to this, we notice that some of the ducks have three toes, and some have four toes. In this case the category is “mallards”, and the attribute is “has three toes”.
A category is going to be represented as a probabilistic relation for each potential member, of the form $p(x \in C)$, where $C$ is the category and $x$ is the object in question. This essentially indicates the degree to which the object $x$ belongs to the category $C$. For instance, if the category is “birds”, then pigeons, sparrows, and eagles all fit into the category very well, so we assign a high probability (close to 1) to pigeons, sparrows, and eagles belonging in the category “birds”. On the other hand, flamingos, penguins, ostriches, and pterodactyls, while still being bird-like, don’t necessarily fit into our archetypal notion of what it means to be a bird. So they get lower probabilities (I’d say around 0.8 for the first 3 and around 0.1 for pterodactyls, but that is all pretty subjective). Finally, dogs and cats get a near-zero probability of being birds, and a table, the plus operator, or Beethoven’s Ninth Symphony would get even-closer-to-zero probabilities of being a bird.
In the example with ducks, the probability of being a mallard will probably be based on how many observed characteristics the duck in question has in common with mallards.
For now we will think of a rule or an attribute as an observable binary property about various objects. Let’s let $P$ be the set of all things that have this property. Now we assume that membership in $P$ has some base probability of occurrence $\theta$, together with some probability of occurrence within $C$ of $\theta_C$. We will assume that there are objects $x_i$, $y_j$, and $z_k$, such that the $x_i$ were all observed to lie in $P$, the $y_j$ were all observed to not lie in $P$, and membership in $P$ has been unobserved for the $z_k$.
Furthermore, $x_i$ has probability $p_i$ of belonging to $C$, $y_j$ has probability $q_j$ of belonging to $C$, and $z_k$ has probability $r_k$ of belonging to $C$.
Again, going back to the ducks example, the $x_i$ are the ducks observed to have three toes, the $y_j$ are the ducks observed to have four toes, and the $z_k$ are the ducks whose number of toes we have not yet observed. The $p_i$, $q_j$, and $r_k$ are the aforementioned probabilities that each of these ducks is a mallard.
So to summarize, we have three classes of objects — those which certainly lie in P, those which certainly don’t lie in P, and those for which we have no information about P. For each element in each of these classes, we have a measure of the extent to which it belongs in C.
Given this setup, how do we actually go about specifying a model explicitly? Basically, we say that if something lies in $C$ then it has property $P$ with probability $\theta_C$, and otherwise it has it with probability $\theta$. Thus we end up with
$p(\theta, \theta_C \ \mid \ p_i, q_j) \propto p(\theta)p(\theta_C) \left(\prod_{i} p_i\theta_C+(1-p_i)\theta\right)\left(\prod_{j} q_j(1-\theta_C)+(1-q_j)(1-\theta)\right)$
What this ends up meaning is that $\theta$ will shift to accommodate the objects that (most likely) don’t lie in $C$, and $\theta_C$ will shift to accommodate the objects that (most likely) do lie in $C$. At the same time, if for instance most elements in $C$ also lie in $P$, then the elements that don’t lie in $P$ will have a relatively lower posterior probability of lying in $C$ (compared to its prior probability). So this model not only has the advantage of generalizing across all of $C$ based on the given observations, it also has the advantage of re-evaluating whether an object is likely to lie in a category based on whether it shares attributes with the other objects in that category.
Let’s go back to the mallards example one more time. Suppose that after we make all of our observations, we notice that most of the ducks that we think are mallards also have three toes. Then $\theta_C$ in this case (the probability that a mallard has three toes) will be close to 1. Furthermore, any duck that we observe to have four toes, we will think much less likely to be a mallard, even if it otherwise looks similar (although it wouldn’t be impossible; it could be a genetic mutant, for instance). At the same time, if non-mallards seem to have either three or four toes with roughly equal frequency, then $\theta$ will be close to 0.5.
There is one thing that I am dissatisfied with in the above model, though. As it stands, $\theta$ measures the probability of an object not lying in $C$ to have property $P$, rather than the probability of a generic object to have property $P$. This is mainly a problem because, later on, I would like to be able to talk about an object lying in multiple categories, and I don’t have a good way of doing that yet.
An important thing to realize here is that satisfying a rule or having an attribute is just another way of indicating membership in a set. So we can think of both the category and the attribute as potential sets that an object could lie in; as before, we’ll call the category $C$ and we will call the set of objects having a given attribute $A$. Then $\theta_C$ is something like $p(x \in A \ \mid \ x \in C)$, whereas $\theta$ (should be) $p(x \in A)$, although as I’ve set it up now it’s more like $p(x \in A \ \mid \ x \not\in C)$.
As a final note, this same setup can be applied to the case when there are multiple attributes under consideration. We can also apply it to the case when the objects are themselves categories, and so instead of having strict observations about each attribute, we have some sort of probability that the object will possess an attribute. In this latter case we just treat these probabilities as uncertain observations, as discussed previously.
When To Generalize
Another important question is which rules / attributes we should expect to generalize across a category, and which we should not. For instance, if we would expect “number of toes” to generalize across all animals in a given species, but not “age”.
I think that for binary attributes we will always have to generalize, although our generalization could just be “occurs at the base rate in the population”. However, for more complicated attributes (for instance, age, which has an entire continuum of possible values), we can think about whether to generalize based on whether our posterior distribution ends up very tightly concentrated or very spread out. Even if it ends up fairly spread out, there is still some degree of generalization — for instance, we expect that most insects have fairly short lifespans compared to mammals. You could say that this is actually a scientific fact (since we can measure lifespan), but even if you ran into a completely new species of insect, you would expect it to have a relatively short lifespan.
On the other hand, if we went to a college town, we might quickly infer that everyone there was in their 20s; if we went to a bridge club, we would quickly infer that everyone there was past retirement age.
What I would say is that we can always generalize across a category, but sometimes the posterior distribution we end up with is very spread out, and so our generalization isn’t particularly strong. In some cases, the posterior distribution might not even be usefully different from the prior distribution; in this case, from a computational standpoint it doesn’t make sense to keep track of it, and so it feels like an attribute shouldn’t generalize across a category, while what we really mean is that we don’t bother to keep track of that generalization.
An important thing to keep in mind on this topic is that everything that we do when we construct models is a computational heuristic. If we didn’t care about computational complexity and only wanted to arrive at an answer, we would use Solomonoff induction, or some computable approximation to it. So whenever I talk about something in probabilistic modeling, I’m partially thinking about whether the method is computationally feasible at all, and what sort of heuristics might be necessary to actually implement something in practice.
To summarize:
- attributes always generalize across a category
- but the generalization might be so weak as to be indistinguishable from the prior distribution
- it would be interesting to figure out how we decide which generalizations to keep track of, and which not to
Forming Categories
My final topic of consideration is when to actually form a category. Before going into this, I’d like to share some intuition with you. This is the intuition of explanatory “cheapness” or “expensiveness”. The idea is that events that have low probability under a given hypothesis are “expensive”, and events that have relatively high probability are “cheap”. When a fixed event is expensive under a model, that is evidence against that model; when it is cheap under a model, that is evidence for the model.
The reason that forming clusters can be cheaper is that it allows us to explain many related events simultaneously, which is cheaper than explaining them all separately. Let’s take an extreme example — suppose that 70% of all ducks have long beaks, and 70% have three toes (the remaining 30% in both cases have short beaks and four toes). Then the “cheapest” explanation would be to assign both long beaks and three toes a probability of 0.7, and the probability of our observations would be (supposing there were 100 ducks) $0.7^{140}0.3^{60} \approx 10^{-53}$. However, suppose that we also observe that the ducks with long beaks are exactly the ducks with three toes. Then we can instead postulate that there are two categories, “(long,3)” and “(short,4)”, and that the probability of ending up in the first category is 0.7 while the probability of ending up in the second category is 0.3. This is a much cheaper explanation, as the probability of our observations in this scenario becomes $0.7^{70}0.3^{30} \approx 10^{-27}$. We have to balance this slightly with the fact that an explanation that involves creating an extra category is more expensive (because it has a lower prior probability / higher complexity) than one that doesn’t, but the extra category can’t possibly be $10^{26}$ times more expensive, so we should still always favor it.
This also demonstrates that it will become progressively cheaper to form categories as we get large amounts of similar evidence, which corresponds with our intuitive notion of categories as similarity clusters.
So, now that I’ve actually given you this intuition, how do we actually go about forming categories? Each time we form a category, we will have some base probability of membership in that category, together with a probability of each member of that category possessing each attribute under consideration. If we simply pick prior distributions for each of these parameters, then they will naturally adopt posterior distributions based on the observed data, although these distributions might have multiple high-probability regions if multiple categories can be formed from the data. At the same time, each object will naturally end up with some probability of being in each category, based on how well that category explains its characteristics (as well as the base probability that objects should end up in that category). Putting these together, we see that clusters will naturally form to accommodate the data.
As an example of how we would have multiple high-probability regions, suppose that, as before, there are long-billed, three-toed ducks and short-billed, four-toed ducks. But we also notice that ducks with orange beaks have white feathers and ducks with yellow beaks have brown feathers. If we form a single category, then there are two good candidates, so that category is likely to be either the (long,3) category or the (orange,white) category (or their complements). If we form two categories, each category in isolation is still likely to be either the (long,3) or (orange,white) category, although if the first category is (long,3), then the second is very likely to be (orange,white), and vice versa. In other words, it would be silly to make both categories (long,3), or both categories (orange,white).
The only thing that is left to do is to specify some prior probability of forming a category. While there are various ways to do this, the most commonly used way is the Indian Buffet Process. I won’t explain it in detail here, but I might explain it later.
Future Considerations
There are still some unresolved questions here. First of all, in reality something has the potential to be in many, many categories, and it is not entirely clear how to resolve such issues in the above framework. Secondly, keeping track of all of our observations about a given category can be quite difficult computationally (updating in the above scenarios requires performing computations on high-degree polynomials), so we need efficient algorithms for dealing with all of the data we get.
I’m not sure yet how to deal with the first issue, although I’ll be thinking about it in the coming weeks. To deal with the second issue, I intend to use an approach based on streaming algorithms, which will hopefully make up a good final project for a class I’m taking this semester (Indyk and Rubinfeld’s class on Sublinear Algorithms, if you know about MIT course offerings).