[<< | Prev | Index | Next | >>]

Wednesday, June 03, 2015

Gradient of Auto-Normalized Probability Distributions



Beware, math ahead.

Consider a model which is an auto-normalized probability distribution over the states of a discrete variable `D`:

` "P"(D_k) -= F_k/(sum_m F_m)` (1)

where each `F_m` is a scalar valued function of some (implicit) parameters.

if `del_i` is the partial derivative with respect to some variable that only affects `F_i` then:

` del_i log "P"(D_k) = (del_i F_k)/(F_k) - (del_i F_i)/(sum_m F_m)` (2)

where `del_i F_k = 0` when `i != k`.

Separately, if we have some true data `bar D` (a large set of observed states of `D`, like throws of a die), expressed as a normalized distribution by `bar "P"(bar D)`, then we can compute our model's probability of generating that observed data (per-sample, so independent of the size of `bar D`) as:

` "P"(bar D) = prod_k "P"(D_k)^(bar "P"(bar D_k))` (3)

The log of which is the cross entropy:

` log "P"(bar D) = sum_k bar "P"(bar D_k) log "P"(D_k)` (4)

Taking the partial derivative and pulling in `"P"(D_k)` from above:

` {:(del_i log "P"(bar D),= sum_k bar "P"(bar D_k) del_i log "P"(D_k)), (,= sum_k bar "P"(bar D_k) [ (del_i F_k)/(F_k) - (del_i F_i)/(sum_m F_m) ]), (,= sum_k bar "P"(bar D_k) (del_i F_k)/(F_k) - sum_k bar "P"(bar D_k) (del_i F_i)/(sum_m F_m)), (,= bar "P"(bar D_i) (del_i F_i)/(F_i) - (del_i F_i)/(sum_m F_m)), (,= (del_i F_i)/(F_i) [ bar "P"(bar D_i) - "P"(D_i) ]):}` (5)

In sum:

` {:("P"(D_k),-= F_k/(sum_m F_m)), ("P"(bar D),= prod_k "P"(D_k)^(bar "P"(bar D_k))), (del_i log "P"(bar D),= (del_i F_i)/(F_i) [ bar "P"(bar D_i) - "P"(D_i) ]):}` (6)

So, what's it mean and why is it interesting?

Equation (1) is a fairly common scenario when you have some ability to model the relative likelihood or frequency of events. The denominator normalizes those relative votes into an actionable probability distribution.

Equation (3) in most cases is the definitive measure of the quality of a model, and is thus usually the quantity you want to maximize.

Equation (5) puts them together with the interesting result that the optimizing gradient is proportional to the residual error we get by subtracting our predicted distribution from the observed one. Note because these are probabilities, the components of that residual are nicely bounded from -1 to 1, so it has the makings of a very well-behaved gradient. It is intuitively appealing because it is zero when our prediction matches observation. And it's just surprisingly clean and simple.

Where it goes from here depends on the nature of the underlying `F` functions.


One case to consider is where the `F` distribution is a product of distributions (where each `G_j` is a non-normalized pseudo probability distribution over `i`):

` F_i = prod_j G_(j,i)` (7)

Then:

` del_(j,i) F_i = F_i/G_(j,i) del_(j,i) G_(j,i)` (8)

Applying (5), note `F_i` cancels out, so curiously we end up with essentially (5) again:

` del_(j,i) log "P"(bar D) = (del_(j,i) G_(j,i))/(G_(j,i)) [bar "P"(bar D_i) - "P"(D_i) ]` (9)

The interesting point here being that even though many `G_j`s are being combined into the final predicted distribution, the "explaining away" competition amongst them is coordinated entirely via the consolidated `"P"(D_i)` predictions (as expressed in the error residual). That is, other than the common error residual, the gradient with respect to a parameter of some `G_(j,i)` is local.

Note this is analogous to the process that causes singular vectors to become orthogonal in the incremental SVD approach (when training them all at once as opposed to one by one). There we are removing the net projection of the current vectors from the input before training. Here we are removing the net prediction. This reverse inhibition is a recurring theme in various learning algorithms, including some brain-inspired ones.


[Source: personal notebook, Jan 2005]



[<< | Prev | Index | Next | >>]


Simon Funk / simonfunk@gmail.com