[<< | Prev | Index | Next | >>]

## Wednesday, June 03, 2015

#### Gradient of Auto-Normalized Probability Distributions

Beware, math ahead.

Consider a model which is an auto-normalized probability distribution over the states of a discrete variable D:

where each F_m is a scalar valued function of some (implicit) parameters.

if del_i is the partial derivative with respect to some variable that only affects F_i then:

where del_i F_k = 0 when i != k.

Separately, if we have some true data bar D (a large set of observed states of D, like throws of a die), expressed as a normalized distribution by bar "P"(bar D), then we can compute our model's probability of generating that observed data (per-sample, so independent of the size of bar D) as:

The log of which is the cross entropy:

Taking the partial derivative and pulling in "P"(D_k) from above:

In sum:

So, what's it mean and why is it interesting?

Equation (1) is a fairly common scenario when you have some ability to model the relative likelihood or frequency of events. The denominator normalizes those relative votes into an actionable probability distribution.

Equation (3) in most cases is the definitive measure of the quality of a model, and is thus usually the quantity you want to maximize.

Equation (5) puts them together with the interesting result that the optimizing gradient is proportional to the residual error we get by subtracting our predicted distribution from the observed one. Note because these are probabilities, the components of that residual are nicely bounded from -1 to 1, so it has the makings of a very well-behaved gradient. It is intuitively appealing because it is zero when our prediction matches observation. And it's just surprisingly clean and simple.

Where it goes from here depends on the nature of the underlying F functions.

One case to consider is where the F distribution is a product of distributions (where each G_j is a non-normalized pseudo probability distribution over i):

Then:

Applying (5), note F_i cancels out, so curiously we end up with essentially (5) again:

The interesting point here being that even though many G_js are being combined into the final predicted distribution, the "explaining away" competition amongst them is coordinated entirely via the consolidated "P"(D_i) predictions (as expressed in the error residual). That is, other than the common error residual, the gradient with respect to a parameter of some G_(j,i) is local.

Note this is analogous to the process that causes singular vectors to become orthogonal in the incremental SVD approach (when training them all at once as opposed to one by one). There we are removing the net projection of the current vectors from the input before training. Here we are removing the net prediction. This reverse inhibition is a recurring theme in various learning algorithms, including some brain-inspired ones.

[Source: personal notebook, Jan 2005]

[<< | Prev | Index | Next | >>]

Simon Funk / simonfunk@gmail.com