[<< | Prev | Index | Next | >>]

## Saturday, October 14, 2017

#### Probability Density Mapping

[Research Log: Wherein lies some universal math for self-supervised learning.]

I like the spirit of Variational Bayes but the details seem a bit clunky, so I have been trying to distill the key conceptual elements of it away from the ugly particulars. In the process of that, I happened upon a rather elegant and intuitive bit of math that solves a somewhat related old intuition for what should make an ideal, universal self-supervised learning law, which I will describe here.

Be warned, all that follows is currently untested and not yet thoroughly reviewed. Feedback welcome.

[UPDATE: A friend more in the loop informs me the core idea here was explored back in 2016 by Dinh et al in Density estimation using Real NVP. So if you want a more formal treatment see that, or other related links at the bottom of this post. Now back to my naive re-discovery and informal hand-waving...]

Say we are trying to find the probability distribution of images of hand-written digits (see here for why). Images, being real valued, occupy a continuous space, so actually we are trying to estimate or learn a probability density function* (PDF), specifically one which has high values for valid images of digits, and low values for anything else.

In a perfect world, in order to practically represent that complicated PDF, we would like to "factor" it into a bunch of independent features, such as what the digit is, how thick of a stroke it was drawn with, at what slant, in what style, at what location and scale, and so on. (And we would like those features to be automatically discovered, as in here.) At the limit we'd like to get to more and more esoteric features until we have so many that we could effectively describe any image exactly, via the feature space, right down to the pixel. Probably only the strongest few of those features would matter in the sense of generating meaningful change to the image, and the last features would literally just be describing the last bits of noise that the early features couldn't describe. So, in a sense, we'd like to perform a sort of (non-linear) "rotation" from the space of pixel values to the (same-dimensional) space of feature values. What would make this useful is if that feature space were relatively simple and densely filled--ideally with the features being either statistically independent or dependent only in some simple ways that are easy to model. Because then, for one, we could (easily) randomly pick a value in the feature space and generate an image from that, and it should be a reasonable image. And likely, if we moved smoothly through the feature space, we should see the image change smoothly in character accordingly. And for two, almost certainly if the first is true, then those features would also be useful descriptors of the image for purposes of classification, reasoning, and so on.

That is, we would like to map (bidirectionally) from the lumpy, twisty, sparse PDF describing the natural image space into a smooth, compactly filled PDF that's easy to describe and manipulate.

But as I've described it, particularly with the mapping being one-to-one where the feature-encoded version contains enough information to perfectly reconstruct the original image (and to conceptually infinite precision), what's the metric we're trying to optimize, exactly?

Although I landed there from the Variational Bayes direction, the answer is most closely related to Infomax*, particularly the deterministic variant described here. However, Infomax is a bit slippery to the intuition, and for a reason: It implicitly targets an infinite prior (unbounded maximal entropy), and so various work-arounds have to reign that in. (In the original version, via some qualified encoding noise. In the second case, they restrain the output domain through sigmoidal non-linearities, effectively creating a finite prior on the output space.)

There is a much simpler approach that is just one term away from Infomax, but is vastly more intuitive and manageable: Rather than maximizing mutual information between the input (image) and output (features), simply maximize the probability of the input.

But wait... That's nothing new. Even I've been proposing that for (25?) years... And we all know it's invariably messy approximations to an NP-hard problem, right? Well, no, when you set up the problem as I've described above, it turns out the solution is rather trivial:

If we choose a friendly PDF for our feature space, then we only have to map that distribution through our (bi-directional) transformation function to see what it looks like in the input space. And that's a standard bit of calculus probably worked out in the mid 1800's, solvable in polynomial time (roughly "O"(n^3)) under some reasonable assumptions. And once we have that, we're home free, because we can differentiate that with respect to our free parameters, and then maximize it over our empirical samples.

Although in practice it goes the other direction. Intuitively it works like this: For each input sample (image), we map it through our current transform into the feature space. The effect of the gradient (to boost the probability of that sample) is to expand the PDF in the feature space at that point while simultaneously contracting or bending the feature space to stay within its high probability regions. In other words, our input samples project through our transform to make a cloud in the feature space, and the PDF we chose for the feature space acts like a sort of cage which tries to constrain that cloud at the same time the cloud is trying to grow both as large and uniform as possible (each point is trying to expand the space around itself, i.e., push away the other points). So what starts out as a lumpy, twisty, sparse cloud, inflates like a balloon into the well-defined cage we've made for it, hopefully unfolding and filling out in the process so that by the time it's done it fits the space snugly. (The "cage" here is not defined by sharp boundaries, but rather softly by the isosurfaces of the PDF.)

The two conceptual terms--the inward pressure from the cage, and the outward pressure to expand and fill the space--fall out naturally from the math as two distinct factors. In the log probability gradient, these two terms would be simply summed, and so compete against each other linearly.

Because of this separability, the feature space PDF (the "cage") can look however we want with minimal impact on the algorithm, and if we like we can parameterize that too and allow it to be adaptive. (E.g., should the features have normal distributions or have longer tails, or even be multi-modal? If we make things like that tunable, then the cage ceases to be rigid and instead can flex to fit a variety of common but simple shapes.)

Without further ado, here's the math, with more discussion to follow:

Consider samples drawn from some probability density function over a pattern space, such as the space of images:

  x\ ~\ p_t(x) (1)

Here x is a single image, and p_t describes, for instance, the true distribution of natural images.

Now consider a feed-forward function, F, such as a typical deep network, which maps from x (an image) to some encoding, z:

  z = F_omega(x) (2)

Here omega are the tunable parameters of the network (the "weights") and will be omitted as implicit henceforth for clarity.

Assume F is smooth and invertible, and so for now that x and z are of the same dimensionality or degrees of freedom.

If we draw z from some prior:

  z\ ~\ p_z(z) (3)

And generate x from that:

  x = F^-1(z) (4)

Then*:

  p_omega(x) = p_z(z) |dz/dx| (5)

Where |dz/dx| is the absolute value of the determinant of the Jacobian* of F at x.

We can choose p_z(z) to be a simple PDF that is trivial to sample from, such as a spherical unit normal distribution:

  p_z -= \mathcal{N}(0; "I") (6)

At which point equation (5) is now a deterministic, differentiable (with respect to omega), exact formula for p_omega(x) computable in polynomial time. (Note that in practice we are not sampling z and computing x but, rather, empirically sampling x and computing z from it. Hence F is only used in forward mode and we do not need its inverse at all, except if we wish to use it as a pattern generator.)

We are aiming to make p_omega(x) approximate the true distribution p_t(x), and so we would typically follow a log-probability gradient of equation (5) with something like:

  Delta omega prop sum_{x\ ~\ p_t(x)} ( del/del_omega log p_z(F_omega(x)) + del/del_omega log |{dF_omega(x)}/dx| ) (7)

Equation (5) is the centerpiece here, so let's look at that starting with the last term:

The Jacobian itself, dz/dx, maps change along each axis of x (each pixel, in the case of a grey scale image) to change in some direction (a vector) in the feature space of z. Stacked together into a matrix, it describes what a tiny axis-aligned cube at x in the image space looks like after it's been mapped into to a parallelepiped* at z in the feature space. The determinant of that matrix, in turn, measures the volume of that parallelepiped (implicitly: relative to the original unit cube). Equation (2) effectively states that the probability of a sample falling in those two volumes is the same (because they are everywhere analogous, within a local linear approximation), which means the probability density must differ by the ratio of their volumes, which leads directly to equation (5). (More Jacobian basics here.) So, in sum, making this term bigger equates to assigning more probability to (that particular) z. Worth note is that it may also arbitrarily move z around in the process. That term alone places no constraints on what z location a given x maps to--it only cares how much the immediately surrounding space is expanded or compressed by the mapping. So this term is concerned only with the density or spreading of the output point cloud.

Next then, how should we choose the feature space PDF p_z(z)? Complementary to the Jacobian term, this one cares only about z's absolute location, and so determines the shape of our output point cloud. What we want here is a PDF that is easy to compute, easy to sample from, and which F can hope to reshape our input point cloud to fit.

A normal distribution is a natural choice, but has a few shortcomings. For one, it is rotationally symmetric (its isosurfaces are spheres) which means there will be no preferred alignment of true features with the axes of z. I.e., it may find a bunch of useful feature, but they'll all be randomly scrambled together in z. Also, more abstract features tend to follow less normal distributions (normal distributions are what you get when you blend a lot of things together; but we are trying to pull things apart back into their causes), so if our distributions can be more adaptive in shape, we relieve a lot of pressure from F. Then F can be more concerned with factoring and orienting, and p_z with shaping the distribution in the resulting space. In a sense we can attack the problem from both ends, in different ways.

Since we would generally like our features to be independent, it makes sense to treat them as such in defining p_z. And once we do that with any non-normal distribution, the space should prefer features to be axis aligned.

Consider independent Laplacians*. (Probably a good choice in many cases for a simple, non-adaptive p_z.) In higher dimensions this corresponds to a sparse code* with most of the components of z in any given sample being near zero, and a small number being large. Rather than being a sphere, this distribution looks more like a spiked mace, with spikes down the axes of z which will tend to capture and hold useful feature directions (it wants to avoid the low-probability regions in between the spikes, which is where you end up if you spread a single long-tailed feature amongst multiple components). Note that scaling internal to F can optimize the effective sparseness on a per-axis basis, so we needn't decide a priori how sparse z should be nor which features should be more sparse than others. (The ones that wish to be more sparse can simply scale down so that it takes a larger deviation of that feature to cause the same effect. [Footnote 1])

In general, whatever distribution we pick, we want it to be fairly homogeneous so that any local minima are as good as any other. I.e., if our "cage" has too particular of a shape, the odds are high that our point cloud will get wedged in at some wrong orientation and be stuck there. This is alleviated if there is no wrong overall orientation. (Consider the spiked mace -- all we care about is each feature finds a spike to live in, but we don't care which one.) [Footnote 2]

The next question is what should F look like? The most challenging aspect here is the requirement that it be invertible. (Note, btw, that it is the invertibility that ultimately keeps this algorithm O-time efficient: by keeping things deterministic in all directions, we never need to search for an explanation to something.) But if we would like to use it as a generator (z to x direction) as well as recognizer, then we need designed-in invertibility anyway. Interestingly, though, there appears to be no requirement that it be analytically invertible--it would suffice, for instance, to have some function G ~~ F^-1 which learns the inverse mapping entirely separately. While the primary gradient doesn't rely on such a G being available, there are a couple of ways it could be useful in the training, one of which I'll mention now and the other later:

To our advantage, the Jacobian term will steer F away from locally non-invertible states with infinite force (non-invertible equates to zero probability which equates to negative infinity in the log space where our gradient lives), but the Jacobian cannot detect if the feature cloud bends around smoothly and crosses through itself. However, G would implicitly detect these regions since they would be unresolvable, and there are various ways that could be incorporated either into the gradient or heuristics to avoid that case. I.e., by coupling the two, G could empirically bind F to invertibility even if we can't directly invert F. (For instance, we could simply pair the two as an auto-encoder and then add the resulting error with fairly tight tolerances as a penalty term to our gradient. As long as G is able to track F well, it will have little impact on F, but where it starts to fail due to non-invertibility, that error will propagate back through G into F, pushing it back into the region where G can properly disentangle the mapping.)

On the tack of making F explicitly invertible, functions that are non-invertible are throwing away some information, so if we simply recover and deliver that information in another channel, we re-establish invertibility. For instance, the average of two numbers is naturally complemented by their difference. The magnitude of a complex number is complemented by its phase, and so on. [Footnote 3] Also, any function we build from invertible components (provided we don't drop any of their outputs along the way) will be invertible as well.

Matrix multiply is invertible as long as the matrix isn't singular (which the Jacobian determinant will steer away from). Typical wavelet transforms are in effect just big, invertible, matrix multiplies. This means equation (5) could potentially discover wavelets under the simple case of:

  F(x) -= M x (8)

This linear case has been analyzed extensively in the context of Infomax with similar equations, resulting in one of the best methods of Independent Component Analysis*. Note here the Jacobian term is independent of x entirely (dz/dx = M) which means M is simply encouraged to be as orthogonal as possible, and scaled as large as possible, within the constraints of p_z. (Note also that we can analyze convolutional nets as just a case where much of M is clamped to zero, so that each feature has limited scope on x. Similarly if we specifically wanted to hunt for natural wavelets, we could configure a sparse M similar to a collection of convolutional ones at different scales. Either of these cases also greatly speeds the computation of the determinant, at least in the purely linear case--which it remains even if we have a non-linear p_z since that term falls outside the determinant.)

Moving on, an obvious question is whether we can get away with some approximation to invertibility, particularly to relax the constraint that our input (image) and output (feature) space be of the same size. Here are some initial thoughts:

For starters, if the spaces are of different size, the Jacobian is no longer square, and then the determinant is not defined. However, we can square (multiply by its own transpose) the Jacobian, whichever way results in the smaller matrix, and use the square-root of the determinant of that. This gives us something analogous (and equivalent in the same-size case), but what exactly depends on some particulars.

When the output space is smaller (dimensions) than the input space, equation (5) is working on a naive projection of the would-be higher dimensional output cloud into the lower dimensionality space by simply dropping the extra dimensions. So instead of telling us the probability of x, it's telling us the total probability of the whole subspace of x's that map to that particular z (with the distribution within that subspace left completely undefined). In principle this could still work fine in many cases, since it is still trying to maximize the spread of the point cloud in the output space, which is best achieved by discarding the least filled (i.e., most irrelevant) output dimensions. [Update: Maybe not--need to think about this more, and may have it backwards. See below.]

For example, imagine our input point cloud falls roughly in a 2d plane embedded in a 3d space, and we set up F to map from 3-space to 2-space. (I.e. a typical rendering of a 3d point cloud onto a 2d display. F is the rendering aka projection transform here.) Since equation (5) is trying to uniformly fill the space with those points, it will naturally prefer to orient the 2d plane face-on, since to orient it edge-on would concentrate all the points into a dense line (which is the antithesis of its objective). [Update: That in particular is suspect. It may in fact prefer to rotate the expansive direction in x into the missing z dimensions where there is no p_z penalty, and then maximize the Jacobian in the least relevant x direction where it can do so cheaply. If so, something more clever is needed to enable dimensionality reduction with this approach.] And note that while there may be no obvious way to invert this function (particularly if there was some 3d variation from a perfect plane in our data), once it's face-on aligned like that it becomes much easier to learn an inverse function G which projects our lower dimensional feature points back into the higher dimensional pattern (input/image) space, using whatever best-fit metric is appropriate.

In other words, the invertibility constraint may be rather soft. [Update: or... not.]

What about when the output space is larger than the input space? Now the forward function can easily be invertible, but there's a new problem: The output point cloud now lies on (and the determinant measures expansion on) a lower dimensional subspace. So imagine we have a spherical target distribution we are trying to fill, but our projection comes in the form of a 2d surface like a piece of paper. We can expand and expand the paper forever and still fit it in the sphere simply by crumpling it up more and more--which is the opposite of what we want.

However, it is possible to rescue this case by imposing some geometric constraints on how the point cloud is allowed to move in the output space. Effectively, anything that keeps it from crumpling up solves that particular problem. I'll illustrate with an example that also makes an interesting general strategy for deep nets:

Consider a 2d input space (each example is just two scalar values) which when viewed as a point cloud looks like a worm randomly curled and squiggled on our plot. Assume that we know this, but we just don't know what particular shape it might take. A natural feature representation here is just a single scalar uniformly spanning 0 to 1 which measures the relative distance along the worm's body from head to tail. That is, the generative model is: pick a random number from 0 to 1, push it through the inverse of F, and out pops the x,y location of the worm's body at that fraction along. But that describes an infinitely thin worm, just a pencil line on our graph, whereas our point cloud worm is a bit plump.

Some observations at this point: The forward mapping F, in this case, would actually be a bit heuristic, something like searching along the length of the current worm model to find the closest point to our sample, and then using that as our z value. (p_z in this case is constant and plays no role; the model itself constrains z to a finite domain, and so the last term simply tries to make z as uniform as possible within that range.) Note that this is still invertible up to the best approximation on the thin line (the generative model described above) and differentiable (the gradient is aligned with the worm's body at that point), so it may simply work as-is (per the dimensionality-reducing reasoning above) although in practice it's a rather obscured gradient (it only "sees" the projection of one worm onto the other, and has to try to guess what shape the original is based on how clumpy that projection is--and can only reshape by expanding and contracting the relative length at various points), so my guess is it would have a hard time.

We could add another feature which measures orthogonally away from the worm's body, and now we have a 2d-2d, properly invertible mapping. This should learn our plump worm point cloud just fine. (Unsure about local minima in practice.) This would probably be the "right" way to handle this example, but let's consider something different:

What if we mapped our input point to our single scalar measure along the body, then mapped this back into the input space (effectively finding the closest point alone the center-line of our current model worm's body), and then produced the 2d error as an additional pair of features? Now we have three features in total -- the distance along the body, and a 2d displacement from that. I.e., we are projecting our 2d input space into a 3d feature space.

What this looks like in the feature space is a twisty 2d ribbon hanging down. The vertical offset corresponds to our distance along the worm's body, and then the ribbon's twist or orientation at that height has it facing the direction of the worm's body in the input space at the corresponding location (because the "error" vector can only be orthogonal to that). For p_z we might choose a cylinder, ranging uniformly from 0 to 1 bottom to top, and having a 2d normal distribution for its heft.

When we start out with our model worm in some random orientation, our input point cloud (our empirical worm) will appear in our feature space as a 2d point cloud painted on our twisty ribbon--squiggling all over it, jutting well outside of our cylinder and such. But because of the geometric constraints on that ribbon, the point cloud (which lives on the ribbon) cannot crumple up--all it can do is move around on the ribbon. (In turn the ribbon will swivel around, but the gradient here is insensitive to that in any direct sense.) So, as per the generic case, the points that stick way outside our cylinder (by which I mean: into the tails of our 2d normal distribution) will be pulled back inside--which translates directly to moving our model worm closer to the empirical one. The training is done when our point cloud looks like a plump, straight stripe down the center of our ribbon, about the width of our cylinder. (Again the twisting of the ribbon here is indicative but inconsequential.) This corresponds to our model worm being perfectly aligned with the true one. (The one catch is our model worm will have a round head even if the original was square...)

So, in sum, we can get away with having a larger output space than input space under certain constraints. (I suspect it might be formalizable as matching degrees of freedom, but not sure offhand of what exactly...)

Furthermore, this example suggests a couple of ideas for self-supervised deep learning:

If we think about what's inside the worm model, there's a sort of feature expansion and contraction going on. The worm there is (likely) modeled via enumeration, where each "segment" is analogous to a different feature (like a differently oriented Gabor filter for vision), and then this expanded encoding is compressed down again to a compact one (the distance along the body). In the more general, higher-dimensional case, we might imagine an image patch being expanded into a very large number of templates (i.e., a very high-dimensional but very sparse intermediate feature space), and then compressed back down into a smaller number of features on the order of the number of pixels in the patch. Equation (5) here would only look at the pixels in and the features out, and would indirectly tug around the (much larger) middle layer through those.

Second, we can imagine using a sort of auto-encoding setup with F and F^-1 (or a learned G), for each layer of a deep network, shedding "error vector" z components all along the way. I.e., the generative model here would be: pick some "noise", add it to the current layer's activations, deterministically generate the next layer down's activations, repeat. This starts to sound like Auto-encoding Variational Bayes* in a ladder network?, but with an interesting twist. We might choose normal distributions for the "error" terms, and adaptive or Laplacian distributions for the "feature" terms (such as at the top layer, or if we wish to shed some feature components layer by layer as well). That is, we have qualitatively different types of outputs here, which we can treat qualitatively differently. Note also that these "error" terms are exactly the ones mentioned above in the strategy to enforce invertibility via the approximate inverse G, except that instead of being (arbitrary) "tight tolerances" they are properly accounted noise terms. I.e., this configuration inherently monitors F for, and protects against, self-collisions, which frees us up to keep F quite flexible.

Hopefully a combination of the principles discussed earlier would handle both some up- and down-sizing of the dimensionality, as we might imagine in the end having a small number of (heavy-tailed) features, and large number of (normal) "noise" terms, with the former being smaller and the sum of both being larger than the input dimensionality.

I've gone on at rather at length already, but there's one more topic to cover:

The determinant is rather expensive to compute. Naively done, it is O(N!) but there are O(N^3) methods. Still, that's pretty bad when N is the number of pixels in a large image, say. However, there are lots of things that could be done to mediate this. For one, if we consider the model above where we are shedding z components layer by layer, assuming local scope of the lower layers this would produce a relatively sparse Jacobian (of fixed sparse structure) which could be optimized for. Second, consider a case like equation (8). There we could simply choose to represent M by its LDU decomposition*, for which the determinant is trivial to compute (it's just the product of the elements of D). (We needn't worry about pivoting because we don't care about the order of the output features.) In the general case, we can look for a way to formulate F such that its Jacobian determinant is easy to compute. Third, we could look for cheaper approximations that might apply to particular architectures. Fourth, we might be able to find a gradient that implicitly rather than directly expands the Jacobian, such as how Oja's Rule* enforces both orthogonality and normalization through reverse inhibition (the orthogonality aspect being most relevant here).

Lastly offhand, note that a deep network without layer skipping connections can be viewed as a chain of functions (one for each layer), which by the chain rule* for the Jacobian determinant means we can compute it separately for each layer and just add them up (in the log gradient space). These gradients are nearly independent of each other (as if each layer were strictly self-supervising) except that each layer receives the output of the lower layer as its empirical x input, which means that higher layers do impact (supervise!) lower layers through this path.

Ok, one more thought: Applying the above ideas, it occurs to me that it's dirt cheap to compute the Jacobian determinant for a typical neural network (at least one with smooth, independent non-linearities like element-wise sigmoid) using the chain rule to connect the matrix with the non-linearities: Represent the weight matrix as LDU and then the log Jacobian determinant is just the sum of the log diagonals of D plus the sum of the log slopes of the non-linearities (evaluated at x). Presto -- deep network, self-supervised, computationally about the same cost as supervised. (But it's four in the morning, so don't quote me on that.)

Anyway, enough for now. Seems like there's a lot to explore here.

If you find a use for or expand upon these ideas, please let me know. I will maintain an index of links to any pages or papers at the bottom of this page.

Footnote 1:

Interesting to note that the gradient of the log Laplacian PDF is just the negative absolute value of each element of z. I.e., the first term of equation (7) in that case is simply pushing all of the elements of z toward zero all the time, with the same constant force. So one wonders, how does that ever find an equilibrium instead of either sticking to zero or growing forever? (Compare to the normal distribution where the restorative force is proportional to the z values and hence will tend to grow under an outward force until that increasing inward force just cancels it out.) The answer is that the expansive force from the Jacobian term shrinks by the inverse of the scale of z (when the scale is already very large, adding a constant amount does very little), so it can only get so large before it's too weak to push against the constant restorative force.

Footnote 2:

More complex distributions are possible, including with dependencies between the elements of z. Temporal sequences, in particular, seem worth exploring. Consider, for example, if we know that x is composed of a temporal sequence of frames (e.g., if x were a short video). Then we could chop up z similarly, and add a (learned) dependency from past to future frames of z. That is, as a generative model, we would pick the features for our first frame, then use a deep net to compute the parameters for the portion of p_z pertaining to our second frame, sample that part of z (the features for our second frame) accordingly, and so on until we have a complete z. Because z is computed deterministically as F(x), then p_z is fully specified during training from examples (for training F), and likewise all of z is available to supervised-train the lateral deep net which unrolls z. What makes the example interesting vs. typical unrolling like this is that significantly more of the space could be devoted to the first frame of features (which need to describe the entire base scene) than to the latter frames (which may only need to describe how elements in that scene move, for instance)--because we are still training the entire sequence as a single ensemble.

I'm unsure offhand if lateral dependencies in the feature space as described confer any greater abilities than could just be learned in a comparably equipped F, however, so this may be a needless over complication.

Footnote 3:

If we normalize a vector, the normalizing constant becomes the missing information. This case is more complicated because we're lifting an N-sized vector into an N+1 sized space. For instance, in the case of a 2d vector, we are transforming from the 2d plane into a cylinder, where the angle of the vector is preserved by the length now becomes the height on the cylinder. Note the cylinder is still a 2d surface as our original plane was, but it lives in a 3d space. For the whole system to work as prescribed so far, we eventually need to cast this back into a 2d space in order to compute the Jacobian, which (once we've mucked about some more in the 3d space) can be tricky to do without losing information... Conversely, after conveying the magnitude and then normalizing, we could just throw away one of the two components. This would be like conveying a complex number by its real component and its length (technically we'd need signed length to make it fully recoverable). Extending this to a higher-dimensional case, we might normalize the vector and provide that magnitude along with the projections along N-1 learnable directions of interest, for instance.