Estimation of Entropy and Mutual Information

We present some new results on the nonparametric estimation of entropy and mutual information. First, we use an exact local expansion of the entropy function to prove almost sure consistency and central limit theorems for three of the most commonly used discretized information estimators. The setup is related to Grenander's method of sieves and places no assumptions on the underlying probability measure generating the data. Second, we prove a converse to these consistency theorems, demonstrating that a misapplication of the most common estimation techniques leads to an arbitrarily poor estimate of the true information, even given unlimited data. This inconsistency theorem leads to an analytical approximation of the bias, valid in surprisingly small sample regimes and more accurate than the usual formula of Miller and Madow over a large region of parameter space. The two most practical implications of these results are negative: (1) information estimates in a certain data regime are likely contaminated by bias, even if bias-corrected estimators are used, and (2) confidence intervals calculated by standard techniques drastically underestimate the error of the most common estimation methods. Finally, we note a very useful connection between the bias of entropy estimators and a certain polynomial approximation problem. By casting bias calculation problems in this approximation theory framework, we obtain the best possible generalization of known asymptotic bias results. More interesting, this framework leads to an estimator with some nice properties: the estimator comes equipped with rigorous bounds on the maximum error over all possible underlying probability distributions, and this maximum error turns out to be surprisingly small. We demonstrate the application of this new estimator on both real and simulated data.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[3]  D. Darling On a Class of Problems Related to the Random Division of an Interval , 1953 .

[4]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[5]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .

[6]  Michel Loève,et al.  Probability Theory I , 1977 .

[7]  P. Billingsley,et al.  Ergodic theory and information , 1966 .

[8]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[9]  A. Carlton On the bias of information estimates. , 1969 .

[10]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[11]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[12]  G. Watson Approximation theory and numerical methods , 1980 .

[13]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[14]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[15]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .

[16]  V. Totik,et al.  Moduli of smoothness , 1987 .

[17]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[18]  William Bialek,et al.  Reading a Neural Code , 1991, NIPS.

[19]  P. Bickel,et al.  Achieving Information Bounds in Non and Semiparametric Models , 1990 .

[20]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[23]  George G. Lorentz,et al.  Constructive Approximation , 1993, Grundlehren der mathematischen Wissenschaften.

[24]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[25]  Marjorie G. Hahn,et al.  Limit theorems for the logarithm of sample spacings , 1995 .

[26]  Stefano Panzeri,et al.  The Upward Bias in Measures of Information Derived from Limited Data Samples , 1995, Neural Computation.

[27]  F. Y. Edgeworth,et al.  The theory of statistics , 1996 .

[28]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[29]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[30]  William Bialek,et al.  Spikes: Exploring the Neural Code , 1996 .

[31]  G. Lugosi,et al.  Consistency of Data-driven Histogram Methods for Density Estimation and Classification , 1996 .

[32]  Stefano Panzeri,et al.  Analytical estimates of limited sampling biases in different information measures. , 1996, Network.

[33]  William Bialek,et al.  Entropy and Information in Neural Spike Trains , 1996, cond-mat/9603127.

[34]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[35]  Ioannis Kontoyiannis Second-order noiseless source coding theorems , 1997, IEEE Trans. Inf. Theory.

[36]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[37]  T. Albright,et al.  Efficient Discrimination of Temporal Patterns by Motion-Sensitive Neurons in Primate Visual Cortex , 1998, Neuron.

[38]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[39]  Jonathan D. Victor,et al.  Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials , 2000, Neural Computation.

[40]  G. Jongbloed Minimax lower bounds and moduli of continuity , 2000 .

[41]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[42]  Jonathan D. Victor,et al.  How the brain uses time to represent and process visual information 1 1 Published on the World Wide Web on 16 August 2000. , 2000, Brain Research.

[43]  A. Antos,et al.  Convergence properties of functional estimates for discrete distributions , 2001 .

[44]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[45]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[46]  B. Rao,et al.  Cramer-Rao type integral inequalities for general loss functions , 2001 .

[47]  Alexander G. Dimitrov,et al.  Information Distortion and Neural Coding , 2001 .

[48]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[49]  Cramer-Rao type integral inequalities for general loss functions , 2001 .

[50]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[51]  J. Victor Binless strategies for estimation of information from neural data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[52]  William H. Press,et al.  Numerical recipes in C , 2002 .

[53]  Liam Paninski,et al.  Noise-driven adaptation: in vitro and mathematical analysis , 2003, Neurocomputing.

[54]  A. Dimitrov,et al.  Analysis of neural coding through quantization with an information-based distortion measure , 2003, Network.

[55]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  Sanjeev R. Kulkarni,et al.  Universal entropy estimation via block sorting , 2004, IEEE Transactions on Information Theory.

[57]  José María Amigó,et al.  Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity , 2004, Neural Computation.

[58]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[59]  William Bialek,et al.  Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions , 2002, Neural Computation.

[60]  Shanbao Tong,et al.  Advances in quantitative electroencephalogram analysis methods. , 2004, Annual review of biomedical engineering.

[61]  Thomas Schürmann Bias analysis in entropy estimation , 2004 .