[<< | Prev | Index | Next | >>]

Wednesday, August 15, 2007

Netflix SVD Derivation



Adam Wagman provided this nice elaboration in the Netflix Prize forums of the derivation of my incremental SVD method:

Here's a basic derivation. Let R[i][j] be the known rating by user i for an item j, and let p[i][j] be the predicted rating for that user and item. We'll let k represent the index of the singular vectors [0, N). Let u[k][i] be the element of the kth singular user vector for the ith user, and let v[k][j] be the element of the kth singular item vector for the jth item.

The standard SVD formulation then says that the prediction is computed by:

	p[i][j]  =  sum over k: u[k][i] * v[k][j] 

The error in the prediction is obviously just:

	E  =  R[i][j] - p[i][j]

For gradient descent, you want to take the (partial) derivative of the squared error with respect to each parameter, i.e. with respect to each u[k][i] and each v[k][j].

So let's compute this for one particular rating for user i = I and item j = J, for one singular vector k = K, for the user parameter u value.

	d(E^2)/du[K][i]  =  2 * E * dE/du[K][i].

Since E = R - p, and since R is a constant, dE/du[K][i] is just -dp/du[K][i]. Now p is just a sum over N terms (one for each singular vector), and only one of them is a function of u[K][i], namely the term u[K][i] * v[K][J]. Its derivative with respect to u[K][i] is clearly just v[K][J], and the derivatives of all the other terms are zero.

Putting this all together, for the single rating by user i = I for item j = J, considering just one singular vector k = K.

	d(E^2)/du[K][i]  =  2 * E * (-v[K][J])  =  -2 * (R[i][J] - p[i][J]) * v[K][J]

This is for one rating; the full derivative for u[K][i] is then just the sum over all the ratings (by user I). If you follow the same procedure to take the partial derivative with respect to v[K][J], you'll get

	d(E^2)/dv[K][J]  =  2 * E * (-u[K][i])  =  -2 * (R[i][J] - p[i][J]) * u[K][i]

When using the simple backpropagation algorithm for gradient descent, which is the technique Simon described, one uses an arbitrary parameter called the learning rate as a multiplier on the gradient to use as the step to add to the parameter. So you get:

	newU[K][i]  =  oldU[K][i] - lrate * d(E^2)/du[K][i]  =  oldU[K][i] + lrate * 2 * E * oldV[K][J] 

And since lrate is just an arbitrary constant (although its value is important), Simon just absorbed the factor of 2 into it and wrote:

	newU[K][i]  =  oldU[K][i] + lrate * E * oldV[K][J] 

or, in Simon's notation:

	real err  =  lrate * (rating - predictRating(movie, user));
	userValue[user] + =  err * movieValue[movie];
	movieValue[movie] + =  err * userValue[user];

Here he is using "movie" where I have used "j", "user" where I have used "i", and he is implicitly considering just one singular vector k, i.e. he has abbreviated userValue[k][user] as just userValue[user] and abbreviated movieValue[k][movie] as movieValue[movie].

Hope this helps.



[<< | Prev | Index | Next | >>]


Simon Funk / simonfunk@gmail.com