MR/2: Multiscale Entropy and Applications

The term "entropy" is due to Clausius (1865), and the concept of entropy was introduced by Boltzmann into statistical mechanics, in order to measure the number of microscopic ways that a given macroscopic state can be realized. Shannon (1948) founded the mathematical theory of communication when he suggested that the information gained in a measurement depends on the number of possible outcomes out of which one is realized. Shannon also suggested that the entropy can be used for maximization of the bits transferred under a quality constraint. Jaynes (1957) proposed to use the entropy measure for radio interferometric image deconvolution, in order to select between a set of possible solutions that which contains the minimum of information, or following his entropy definition, that which has maximum entropy. In principle, the solution verifying such a condition should be the most reliable. Much work has been carried out in the last 30 years on the use of entropy for the general problem of data filtering and deconvolution. Traditionally information and entropy are determined from events and the probability of their occurrence. Signal and noise are basic building-blocks of signal and data analysis in the physical and communication sciences. Instead of the probability of an event, we are led to consider the probabilities of our data being either signal or noise. Consider any data signal with interpretative value. Now consider a uniform "scrambling" of the same data signal. (Starck et al., 1998, illustrate this with the widely-used Lena test image.) Any traditional definition of entropy, the main idea of which is to establish a relation between the received information and the probability of the observed event, would give the same entropy for these two cases. A good definition of entropy should instead satisfy the following criteria:

The information in a flat signal is zero.
The amount of information in a signal is independent of the background.
The amount of information is dependent on the noise. A given signal Y (Y = X + Noise) doesn't furnish the same information if the noise is high or small.
The entropy must work in the same way for a signal value which has a value B + epsilon (B being the background), and for a signal value which has a value B - epsilon.
The amount of information is dependent on the correlation in the signal. If a signal S presents large features above the noise, it contains a lot of information. By generating a new set of data from S, by randomly taking the values in S, the large features will evidently disappear, and this new signal will contain less information. But the data values will be the same as in S.

To cater for background, we introduce the concept of multiresolution into our entropy. We will consider that the information contained in some dataset is the sum of the information at different resolution levels, j. A wavelet transform is one choice for such a multiscale decomposition of our data. We define the information of a wavelet coefficient wj(k) at position k and at scale j as I = - ln (p(wj(k))), where p is the probability of the wavelet coefficient. Entropy, commonly denoted as H, is then defined as the sum over all positions, k, and over all scales, j, of all I. For Gaussian noise we continue in this direction, using Gaussian probability distributions, and find that the entropy, H, is the sum over all positions, k, and over all scales, j, of (wj(k)^2)/(2 sigma^2 j) (i.e. the coefficient squared, divided by twice the standard deviation squared of a given scale). Sigma, or the standard deviation, is the (Gaussian) measure of the noise. We see that the information is proportional to the energy of the wavelet coefficients. The higher a wavelet coefficient, then the lower will be the probability, and the higher will be the information furnished by this wavelet coefficient. Our entropy definition is completely dependent on the noise modeling. If we consider a signal S, and we assume that the noise is Gaussian, with a standard deviation equal to sigma, we won't measure the same information compared to the case when we consider that the noise has another standard deviation value, or if the noise follows another distribution. Returning to our example of a signal of substantive value, and a scrambled version of this, we can plot an information versus scale curve (e.g. log(entropy) at each scale using the above definition, versus the multiresolution scale). For the scrambled signal, the curve is flat. For the original signal, it increases with scale. We can use such an entropy versus scale plot to investigate differences between encrypted and unencrypted signals, to study typical versus atypical cases, and to differentiate between atypical or interesting signals.

To read further:

J.L. Starck, F. Murtagh, "Astronomical Image and Signal Processing" , IEEE Signal Processing magazine , 18, 2, pp 30--40, 2001.
J.L. Starck, F. Murtagh, P. Querre, and F. Bonnarel, "Entropy and Astronomical Data Analysis: perspectives from multiresolution analysis", Astron. and Astrophys., 368, pp 730--746,2001.
J.L. Starck, and F. Murtagh, "Multiscale Entropy Filtering", Signal Processing, Vol 76, No 2, pp 147-165, 1999.
J.L. Starck, F. Murtagh and R. Gastaud, "A new entropy measure based on the wavelet transform and noise modeling", IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Processing, 45, 1118-1124, 1998.

Applications in MR/2 are

Multiscale entropy calculation.

1D and 2D data filtering from the multiscale entropy.

Image deconvolution.

Image background fluctuation.