[<< | Prev | Index | Next | >>] Wednesday, September 13, 2023
Orthonormalizing PDM
[Research Log: Probability Density Mapping through orthonormal weight matrices]
Per some recent conversations with Colin Green* about PDM, I worked out some math for maintaining orthonormal weight matrices, which are their own inverse (via transpose) and which have no impact on the probability density so have no local gradient contribution at all. (But their rotational transform does impact the gradient via the indirect impact on subsequent layers--more on that below.)
I'm just going to drop my scribblings in here so I don't lose them, but for reference the notion is that we have some orthonormal matrix M to start with, and we'd like to update it with some gradient delta minus some error correcting matrix (e) such that the final result is still orthonormal.
Since the delta and error are assumed very small, we drop squared terms, assuming local linearity. The resulting modified gradient is circled. It has the effect of turning any small matrix delta into a nearest(?) strictly-rotational change.
Then, under the assumption that in practice the delta matrix is going to be the outer product of the input (x) with some back-propagated gradient (z'), we can derive the incremental version of the above which avoids any matrix-matrix products and operates only on the vector level. (Note the nearly-obscured tick in the final term, which is `z x'^T` where `x' = M^T z'`) This version gives exactly the same result as the matrix version above, without the matrix products.
Finally, the inherent non-linearity of rotations does eventually accumulate errors, which cause M to drift away from orthonormality, which error compounds because its transpose departs from its inverse which makes the whole thing fail eventually. However, we can subtract a third term, which is related simply to the difference between `M M^T` and the identity matrix it should be, which, provided the error is small enough to begin with, will tend to reset M to be exactly orthonormal with minimal impact on its absolute rotation. I have tried both versions in the third section, and they work, but the matrix form (which is expensive anyway) is especially unstable and tends to explode if the error ever gets too large. The vector form is more stable in practice, but I'm not sure why, and I don't know if it likewise can go haywire or if it's inherently more robust. But beware: this third term, to correct any non-orthonormality of the matrix, can be applied after the initial gradient is applied to the matrix, or concurrently. The former is probably more stable across a wider range of learning rates. The latter, however, cancels out x entirely and reduces the entire training gradient to just:
` (z' z^T-z z'^T) M` (1) which surprisingly does work. Note further that if M is orthonormal, then `z^T M = x^T` and this version reduces to exactly the modified delta from the second section, as we should expect: `z' x^T - z x'^T`. In short, passing x through M and back captures any non-orthonormality in M in a way that counteracts it in the gradient. Nifty, eh? (Oh and by the way, this form is also easily applicable to convolutional layers.)
The whole shebang is related in spirit to the Generalized Hebbian Algorithm but is further generalized to an arbitrary target gradient (z') and is unordered. (An unordered GHA would only attain orthonormality with no particular relation to the input! We rely here on z' having useful correlation to the input.)
The reason this all is useful is that the way PDM works in practice, each layer in the network back-propagates the impact that changing its input (dx) has on the probability density expansion of the remaining layers of the network--i.e., encoding the location sensitivity of the network at that layer. This gradient is back-propagated in the usual way, plus the addition of any location sensitivity of that layer. The gradient applied to a given weight in a given layer is a combination of the immediate probability density impact of that weight, plus the indirect impact via the back-propagated gradient. Orthonormal weight matrices are already maximally expanded, so have no local gradient, and are linear so don't have any location sensitivity, but they do rotate the location and so respond to the back-propagated gradient, as well as rotating that gradient on the way down. (Cheaply and easily since their inverse is just their transpose.)
Tangentially of note: For randomly initialized layers, the back-propagated term (which, by the way, subsumes the first term in the PDM equation generated by the final output layer) will tend to wash out over many layers. But the local addition to the back-propagated term by each layer provides proximal contextual feedback to the layers below it, which allows an arbitrarily deep network to learn bottom-up. I.e., PDM exhibits localized Fusion-Reflection, which ought to make it a very fast learner.
[<< | Prev | Index | Next | >>]