[<< | Prev | Index | Next | >>]
Friday, June 26, 2015
I've been procrastinating implementing better sampling in my recent Fusion-Reflection (here Bayesian DAG) network out of laziness. This entry is an attempt at productive procrastination. Saner minds, please tell me if this makes sense before I waste my time implementing it:
In my previous entry I proposed using equation 16 and related as an approximate posterior of the hidden variables given the inputs. (Recall the posterior equates to pattern recognition, in as much as the hidden state encodes the meaning of a pattern, and is also needed for learning.)
The problem with that approximation is it's apt to credit some spurious combinations, and may be just downright wrong in ways and places due to normalization issues as well as information redundancies (since it's a DAG and not a tree). But it's hopefully a good start in the sense that it should give some notable weight to the regions (hidden variable state combinations) of interest.
So I gave some more thought to how exactly to use Rejection Sampling or whatnot to turn that estimate into a true posterior, and settled on whatnot:
Recall our input is `X`, and let's call the aggregate of all our hidden variables `Z` (so `Z -= A,B,C,D` in the figures).
Outlined in that entry is a strategy for drawing samples from an estimate of `"P"(Z|X)`. Let's call that estimate `"Q"(Z|X)` here. Note that the method of drawing the sample (top-down sampling from the distribution implied by equation 16) both provides us with a `Z` (specific states for `A...D`) and easily gives us `"Q"(Z|X)` for that `Z`, as the product of the probabilities of each state selection we make as we're building the sample. Furthermore, given a particular `Z` we can also easily compute the true `"P"(X,Z)`, which is the product of probabilities of those same state selections when made in the naive generative tree (i.e., when generating from scratch). Note that in general `"P"(X,Z)` will be tiny, while `"Q"(Z|X)` could potentially approach One in a well trained network (for `Z`s sampled from same). But what matters here are the relative values as we vary `Z`.
So my proposal is, for a given `X`, instead of sampling just one `Z` from the approximate `"Q"(Z|X)` as before, sample many such `Z`s, and then weight them by:
` "P"(Z|X) prop ["P"(X,Z)]/["Q"(Z|X)]` (1)
Then normalize that collection of weighted `Z`s to a net unity probability and call that our posterior. Note that instead of getting just mean field estimates for the hidden variables, this is actually a weighted collection of state combinations, so it preserves correlations that are lost when the variables are considered separately. This posterior can then be used for training and/or pattern recognition. (For training, the weight of each `Z` can just linearly scale the learning rate with which it is applied.)
The reasoning here is that dividing by `"Q"(Z|X)` cancels out the weight given each `Z` by having been sampled from same, giving us the equivalent of a uniform (by weight) sampling over `Z` that is concentrated (by time) near `"P"(Z|X)` (assuming `Q` is a good estimate). We then further re-weight that "uniform" sampling of `Z` to match the true distribution using `"P"(X,Z)` with `X` held constant, and then normalize to divide out the inferred `"P"(X)` resulting in `"P"(Z|X)`.
So the point is that `Q` gets us in the vicinity (generating good candidate `Z`s), and then we grope around by feel (because we can relatively weigh specific `Z`s easily with `"P"(X,Z)`).
Will it work?
[Update: Turns out this is called Importance Sampling.]
[<< | Prev | Index | Next | >>]