[<< | Prev | Index | Next | >>]

Tuesday, December 14, 2004

Image Wavelets: Summed or Multiplied? (technical)



Email excerpt, probably only of potential interest to those interested in machine vision or image processing. I'm curious to hear anyone's thoughts on the matter.

> The reason for the particular choices in jpg 
> as well as mp3 compression are based on models of visual perception 
> while maximizing compression ratios. I am sure if you make some other 
> choices for the parameters you can get better compression, but the 
> visual quality of the results will suffer too.

JPG uses a DCT in order to sparsify and organize the energy in the image by frequency, gaining compression from the sparseness and from quantizing the high-frequency information more heavily than the low-frequency information based on analogous deficits in human perception.

The color space JPG uses is YUV, defined by the particular non-linear behavior of CRT electron-beams vs. driving voltage, but specifically defined such that Y (which they call the luminance) is:

Y = 0.299R + 0.587G + 0.114B

In fact, this Y is the "luma", not luminance, and is not a mathematically sound quantity -- the input components, R, G, and B, are assumed to be gamma corrected (with I believe a gamma of 1/0.45 [from rec 709] though this is not specified anywhere that I have found yet). The luma is a linear sum of non-linear quantities and does not correspond well with human perception -- it's used because of the legacy connection to CRT electron beams (from which the 2.2 gamma arises), because it was easy to combine the color gun signals via simple weighted sums to get an approximately ok black-and-white signal.

In short, when they were creating JPG, they started with what they had on hand.

The official CIE human-perception based luminance is:

Y = 0.2125R + 0.7154G + 0.0721B

Where here all values are in _linear_ space. This is not what JPEG uses.

So, JPEG is doing DCTs on a non-linear space based on a power law derived from cathode ray tube behavior. It has generally been observed with much glee that this curve also looks something like the curve of human perception of lightness, but the coincidence is only qualitative and approximate.

Now, if we assume reality presents with a compact set of causes for any given image (e.g., for any given patch: light source, texture, color, surface orientation), we can ask: what transform will most sparsify this energy, i.e., separate the sources? A DCT on the linear data, which is what I suspect most people assume JPEG is effectively doing, doesn't really make sense, because the sources in reality are being multiplied together, not added, and so the linear DCT won't separate them (well). What would, I am proposing, is a DCT of the log-luminance, corresponding to the idea that the sources are multiplicatively combined.

So, my recent observation was that JPEG sort of does this without explicitly realizing it (my conjecture is that it is by lucky/empirical coincidence and not by design), in that the gamma curve x^2.222 is a rough approximation to log(luminance), and quite specifically to log(luminance^1/3) which is log(lightness) -- i.e., where the log base is chosen to match well the range of human vision.

If you look at JPEGs, they suck first in darker regions. If you look at the two curves, that's where JPEGs gamma curve deviates most notably from the log(lightness) which I claim might have been a better choice. So, I'm not so sure a new choice in parameters wouldn't result in better compression/quality ratio. Though that's not my point or goal here since I don't really care about JPG compression -- more interested in it from the standpoint of:

Does this all imply, or provide evidence for, the notion that images are better treated as being composed of products of wavelets rather than sums (or equivalently that linear feature-extracting algorithms would do better to start with log(luminance) values)?

Here's a question for you: do you know of any algorithm (or ever heard of any papers that address this) which will find a best-fit non-linear pre-processing curve which minimizes the entropy of the resulting DFT (or DCT) coefficients?

> A lot of people swear by the CIELab color space
> since it is more perceptually meaningful. Gamma correction by itself is 
> not particularly useful, since it is a monotonic transform, one that 
> while being useful for perceptual purposes may or may not be useful for 
> machine vision purposes. It cannot create information.

But it does dramatically change the relationship between real-world image-value causes and the distribution of the resulting coefficients from a DCT (or, say, any linearly based machine-vision algorithm)...

CIE-Lab is again based on a power function, with a gamma of 3., I think. The choice of using power instead of exponent is probably still legacy from the CRT era. I had the impression the CIE empirical research indicated human vision is logarithmic, with a threshold of perception equal to about 1% of the luminance. So Lab is an OK approximation but may still suck on the low end just like YUV.



[<< | Prev | Index | Next | >>]


Simon Funk / simonfunk@gmail.com