[<< | Prev | Index | Next | >>] ## Wednesday, August 15, 2007

Netflix SVD Derivation

Adam Wagman provided this nice elaboration in the Netflix Prize forums of the derivation of my incremental SVD method:

Here's a basic derivation. Let R[i][j] be the known rating by user i for an item j, and let p[i][j] be the predicted rating for that user and item. We'll let k represent the index of the singular vectors [0, N). Let u[k][i] be the element of the kth singular user vector for the ith user, and let v[k][j] be the element of the kth singular item vector for the jth item.The standard SVD formulation then says that the prediction is computed by:

p[i][j] = sum over k: u[k][i] * v[k][j]The error in the prediction is obviously just:

E = R[i][j] - p[i][j]For gradient descent, you want to take the (partial) derivative of the squared error with respect to each parameter, i.e. with respect to each u[k][i] and each v[k][j].

So let's compute this for one particular rating for user i = I and item j = J, for one singular vector k = K, for the user parameter u value.

d(E^2)/du[K][i] = 2 * E * dE/du[K][i].Since E = R - p, and since R is a constant, dE/du[K][i] is just -dp/du[K][i]. Now p is just a sum over N terms (one for each singular vector), and only one of them is a function of u[K][i], namely the term u[K][i] * v[K][J]. Its derivative with respect to u[K][i] is clearly just v[K][J], and the derivatives of all the other terms are zero.

Putting this all together, for the single rating by user i = I for item j = J, considering just one singular vector k = K.

d(E^2)/du[K][i] = 2 * E * (-v[K][J]) = -2 * (R[i][J] - p[i][J]) * v[K][J]This is for one rating; the full derivative for u[K][i] is then just the sum over all the ratings (by user I). If you follow the same procedure to take the partial derivative with respect to v[K][J], you'll get

d(E^2)/dv[K][J] = 2 * E * (-u[K][i]) = -2 * (R[i][J] - p[i][J]) * u[K][i]When using the simple backpropagation algorithm for gradient descent, which is the technique Simon described, one uses an arbitrary parameter called the learning rate as a multiplier on the gradient to use as the step to add to the parameter. So you get:

newU[K][i] = oldU[K][i] - lrate * d(E^2)/du[K][i] = oldU[K][i] + lrate * 2 * E * oldV[K][J]And since lrate is just an arbitrary constant (although its value is important), Simon just absorbed the factor of 2 into it and wrote:

newU[K][i] = oldU[K][i] + lrate * E * oldV[K][J]or, in Simon's notation:

real err = lrate * (rating - predictRating(movie, user)); userValue[user] + = err * movieValue[movie]; movieValue[movie] + = err * userValue[user];Here he is using "movie" where I have used "j", "user" where I have used "i", and he is implicitly considering just one singular vector k, i.e. he has abbreviated userValue[k][user] as just userValue[user] and abbreviated movieValue[k][movie] as movieValue[movie].

Hope this helps.

[<< | Prev | Index | Next | >>]