[<< | Prev | Index | Next | >>]

Tuesday, August 14, 2018

Probability Density Magic

[Research Log: Wherein lies even simpler universal math for self-supervised learning.]

A couple of days ago I took another look at Probability Density Mapping hoping to find some useful bounded approximation (or incremental equivalent or whatnot) to the ungainly determinant. Instead, starting from the log derivative of PDM equation (5) and combining the derivative of the determinant* and the inverse of the Jacobian* with some careful applications of the chain rule*, the determinant just disappeared entirely. Go figure. Here's the final, useful result:

` del_omega log p_omega(x) = del_omega log p_z(z) + (: del_omega dz/dx , {: dx/dz :}^T :)_F` (1)

Where `(: ,:)_F` is the Frobenius inner product*.

Recall, this tells us how to maximize the probability of our model generating a training set. I.e., it is an ideal, unsupervised learning algorithm.

The last term, which is the new bit here, is the partial derivative of the log Jacobian determinant. (As such, it is chainable with simple summation when `x->z` is a chain of functions as in a deep network. But beware when the Jacobian is location sensitive, as with most non-linear layers, that an `omega` from any previous layer will likely have non-zero impact via `(d^2z)/dx^2 dx/(d omega)`)

`dx//dz` is the (matrix) inverse of the forward Jacobian `dz//dx`, or equivalently the Jacobian of the inverse (generative) map from `z` to `x`.

At first this doesn't seem like a big win--we've traded in the determinant for an inverse. But the inverse function is almost certainly something we want anyway: it is the generative mode, which lets us render patterns `x` from our high-level feature vector `z`. Furthermore, we don't have to compute the inverse from scratch: since our forward function is incrementally updating as we learn, we can just train the inverse function at the same time. Unlike the forward mode, which is unsupervised, the generative mode can be trivially supervised (on a per-layer basis) since the forward mode determines all of our hidden states (up to and including `z`).

One term, `del_omega dz//dx`, is a bit ungainly since it is the size of the number of free parameters, `omega`, times the dimensionality of the pattern, `x`, times the dimensionality of the encoding, `z`. But in practice, this can be a very sparse tensor, especially when figured only for a single layer at a time. In plain terms, it is: For each pattern/encoding element pair, `x_i` and `z_j`, how much does changing `x_i` change `z_j`? And then in turn, how much does that change when we change each individual model parameter `omega_k`? Consider, then, that for a given `x_i` and `z_j` pair, any `omega_k` that is nowhere in the path between them will give a zero in the above term. For instance, in the case of a matrix multiply even between fully connected layers, there is only one relevant `omega` per `x_i`, `z_j` pair--the matrix value at position `i`, `j`. I.e., in that common case, the sparse tensor is only actually as large in practice as the underlying matrix.

Anyway, that's the basic formula, and what remains is to find different mapping functions to apply it to.

I'll quickly analyze the matrix multiply as an inspirational example:

` z = Mx` (2)

As mentioned above, the `del_omega` over the matrix multiply is essentially a selector function (with 1 in the matching position and 0 everywhere else) and so serves as an identity tensor in the context of the inner product, resulting in:

` del_M log |d/(dx) M x| = (: del_M d/(dx) M x , {: d/(dz) M^-1 z :}^T :)_F = M^-T` (3)

Where `M^-T = (M^-1)^T`. Note that `M^-1` is the matrix we would use to generate patterns (top down).

This solves the last term of equation (1) for the `z = Mx` case, but the other term depends on our priors for `z`, which we can choose as we like. For the simple case of unit-normally distributed `z`:

` del_M log p_z(z) = -Mx x^T` (4)

Giving (from equation (1)) the learning rule:

` del_M log p_M(x) = M^-T - Mx x^T = M^-T - zx^T` (5)

Which is zero (maximal probability) over the full training matrix `X` when:

` M^-1 M^-T prop X X^T` (6)

Which is satisfied when `M^-1` (our generative matrix) equals the eigenvectors of the covariance matrix of our input vectors (scaled by the square roots of the corresponding eigenvalues). In other words, equation (1) applied to a linear model with a normally distributed code vector trivially leads to PCA*.

That's a good sign, eh?

2018-08-30 Addendum:

For giggles and grins I solved the same example for the generative matrix instead of the perception matrix, `G=M^-1`:

` x = Gz` (7)

Which results in the learning rule:

` del_G log p_G(x) = G^-T (zz^T - I)` (8)

Which again maxes out when `z` is unit-normally distributed as we expect.

Question: can we use both of these learning rules simultaneously to update both `M` and `G` together so that we never have to compute an inverse? Together they would be:

` {: ( del_M = G^T - zx^T ), ( del_G = M^T (zz^T - I) ) :}` (9)

Answer: No! Jointly these quickly diverge as the error `GM!=I` compounds.

2018-08-30 Addendum B:

If we choose a Laplace* prior instead of normal, equations (5) and (8) become:

` del_M log p_M(x) = M^-T - "sign"(z)x^T` (10)

` del_G log p_G(x) = G^-T ("sign"(z)z^T - I)` (11)

[<< | Prev | Index | Next | >>]

Simon Funk / simonfunk@gmail.com