[<< | Prev | Index | Next | >>] Wednesday, June 03, 2015
Gradient of Auto-Normalized Probability Distributions
Beware, math ahead.
Consider a model which is an auto-normalized probability distribution over the states of a discrete variable D:
P(Dk)≡Fk∑mFm (1) where each Fm is a scalar valued function of some (implicit) parameters.
if ∂i is the partial derivative with respect to some variable that only affects Fi then:
∂ilogP(Dk)=∂iFkFk-∂iFi∑mFm (2) where ∂iFk=0 when i≠k.
Separately, if we have some true data ˉD (a large set of observed states of D, like throws of a die), expressed as a normalized distribution by ˉP(ˉD), then we can compute our model's probability of generating that observed data (per-sample, so independent of the size of ˉD) as:
P(ˉD)=∏kP(Dk)ˉP(ˉDk) (3) The log of which is the cross entropy:
logP(ˉD)=∑kˉP(ˉDk)logP(Dk) (4) Taking the partial derivative and pulling in P(Dk) from above:
∂ilogP(ˉD)=∑kˉP(ˉDk)∂ilogP(Dk)=∑kˉP(ˉDk)[∂iFkFk-∂iFi∑mFm]=∑kˉP(ˉDk)∂iFkFk-∑kˉP(ˉDk)∂iFi∑mFm=ˉP(ˉDi)∂iFiFi-∂iFi∑mFm=∂iFiFi[ˉP(ˉDi)-P(Di)] (5) In sum:
P(Dk)≡Fk∑mFmP(ˉD)=∏kP(Dk)ˉP(ˉDk)∂ilogP(ˉD)=∂iFiFi[ˉP(ˉDi)-P(Di)] (6) So, what's it mean and why is it interesting?
Equation (1) is a fairly common scenario when you have some ability to model the relative likelihood or frequency of events. The denominator normalizes those relative votes into an actionable probability distribution.
Equation (3) in most cases is the definitive measure of the quality of a model, and is thus usually the quantity you want to maximize.
Equation (5) puts them together with the interesting result that the optimizing gradient is proportional to the residual error we get by subtracting our predicted distribution from the observed one. Note because these are probabilities, the components of that residual are nicely bounded from -1 to 1, so it has the makings of a very well-behaved gradient. It is intuitively appealing because it is zero when our prediction matches observation. And it's just surprisingly clean and simple.
Where it goes from here depends on the nature of the underlying F functions.
One case to consider is where the F distribution is a product of distributions (where each Gj is a non-normalized pseudo probability distribution over i):
Fi=∏jGj,i (7) Then:
∂j,iFi=FiGj,i∂j,iGj,i (8) Applying (5), note Fi cancels out, so curiously we end up with essentially (5) again:
∂j,ilogP(ˉD)=∂j,iGj,iGj,i[ˉP(ˉDi)-P(Di)] (9) The interesting point here being that even though many Gjs are being combined into the final predicted distribution, the "explaining away" competition amongst them is coordinated entirely via the consolidated P(Di) predictions (as expressed in the error residual). That is, other than the common error residual, the gradient with respect to a parameter of some Gj,i is local.
Note this is analogous to the process that causes singular vectors to become orthogonal in the incremental SVD approach (when training them all at once as opposed to one by one). There we are removing the net projection of the current vectors from the input before training. Here we are removing the net prediction. This reverse inhibition is a recurring theme in various learning algorithms, including some brain-inspired ones.
[Source: personal notebook, Jan 2005]
[<< | Prev | Index | Next | >>]