[<< | Prev | Index | Next | >>]
Friday, August 17, 2007
After looking over the Proceedings of KDD Cup and Workshop 2007 I am reminded of some things I left dangling where I left off in December.
This is bad, because I've got other things I should be working on. :)
But, while it's in my head, here it is:
[As a preliminary I should mention that the performance boost I got a day or two after I posted my December entry was due to a run finishing that was training all features simultaneously. While this shouldn't make any difference in the linear case due to orthogonality of the features, once you introduce any clipping or especially regularization it becomes advantageous. This is relevant in the discussion that follows since later I talk about using the covariance matrix of features which only makes sense when training multiple features at once.]
A detail I glossed over in my original description was on the choice of the particular regularization term. There were two obvious options: One was to penalize the magnitude of the feature vectors, which corresponds to a prior assumption that the features are drawn from normal distributions with some fixed variance, and the other is to penalize the magnitude of the feature values in proportion to how often they are used which, if one is to interpret that as a simple prior at all, assumes a prior whose variance is inversely proportional to, e.g., the number of ratings by the corresponding user; or, looked at another way, corresponds to penalizing the magnitude of the unrolled feature vectors -- where instead of having one scalar value per user per feature, for instance, we have one scalar value per observed rating per feature, and then we tie many of those values together (weight sharing) according to which users cast each rating. (Note those two schemes are identical except in how one measures the magnitude of the resulting feature vector.)
My method uses the latter of those two options, for a few reasons. First, intuitively it makes sense to penalize in proportion to usage: We assume similar usage patterns will continue, so that a user who hardly ever rates will likewise have little impact on future rmse, and thus the cost of their feature values being too large is proportionally diminished. Secondly, after trying both anyway it simply worked better. :) Lastly, it's slightly easier to implement and more well-behaved in an incremental algorithm.
The open question is: why, really, does that work better? (Or: does it? If anyone else tried both methods I would like to get independent confirmation.)
A Bayesian formulation prefers the other approach, but this results in the feature vectors for any sparsely observed users being squelched to near zero. Essentially the prior dictates that by default all feature values should be zero, and any user with few ratings simply never overcomes this prior. The usage-relative prior, then, "fixes" this by softening up the prior for sparsely observed users (and movies). But really, this "fix" is just a violation of the whole notion of priors... so why does it improve things? I have two theories about this. Please feel free to enlighten me if you have a better clue:
One is that, related to the intuitive justification given above, there is some implicit variational Bayes going on here, in that we are exploring the space of models and asking about the impact of that. Specifically: if we assume, for instance, that our movie-side model (movie features) is only approximately correct, then we can imagine that the true movie-side model, or the one that will best predict future ratings, will be some random offset away from the one we have. And we can ask: what is the effect of that offset? That effect is going to be proportional to the magnitude and prevalence of the features which scale those movie features: the user features. In other words, the larger and more common a user feature is, the more the error in the corresponding movie features will be amplified. (And the converse as well.) So, we are actually treating our model as the center of a (normal?) distribution of hypothetical models, and accounting (hand-wave, hand-wave) for the expected value of the probability of our data set over that distribution of models.
You look skeptical.
The other idea, not necessarily incompatible with the first, is that the prior itself sucks. Specifically: the prior assumes that each feature value is taken independently of the others from a normal distribution. But in truth if we look at the correlation matrix between the feature values there are quite a lot of sizable values off the diagonal. In other words, before we know anything about a given user's specific preferences, we know how their various preferences are likely to interact. (For instance, someone who loves christian themes is less apt to love gay themes, which is useful information as a prior assumption even if no single feature directly captures that particular interaction.) By taking this covariance matrix into account, we might be able to tighten our priors back down on the sparsely observed users while still leaving them room to move (in specific directions). Arguably, this fine shape of the priors cloud is far more important for the sparse cases where the priors dominate, vs. the strongly observed cases which are able to define their own shape within a radially symmetric priors cloud. Hand-wave, hand-wave.
Or both could be true at the same time. I am tempted to try to work out an incremental algorithm accounting for a covariance matrix on the priors as well as some variational aspects...
Curious to hear any thoughts.
(The paper that really got me thinking about this again was Variational Bayesian Approach to Movie Rating Prediction. One thing that bothers me is their MAP model [eqn 12] should be equivalent to my model with the worse of the two regularization options, yet they claim it works better... What am I missing? Or are they using a covarience matrix [as opposed to, e.g., using eqn 9] for their MAP model? Isn't clear.)
[UPDATE: I just found this paper (Principal Component Analysis for Large Scale Problems with Lots of Missing Values) which seems highly related. Note the form of eqns 12 and 13. This is a gradient method derived from a variational Bayes approach which results in essentially the same gradient I use except with finer control over the regularization parameters. Score one for intuition!]
[<< | Prev | Index | Next | >>]