Kullback-Leibler divergence estimation of continuous distributions

We present a method for estimating the KL divergence between continuous densities and we prove it converges almost surely. Divergence estimation is typically solved estimating the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be either estimated using the empirical cdf or k-nearest-neighbour density estimation, which does not converge to the true measure for finite k. The convergence proof is based on describing the statistics of our estimator using waiting-times distributions, as the exponential or Erlang. We illustrate the proposed estimators and show how they compare to existing methods based on density estimation, and we also outline how our divergence estimators can be used for solving the two-sample problem.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  G. Grimmett,et al.  Probability and random processes , 2002 .

[3]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[4]  A. Singh Exponential Distribution: Theory, Methods and Applications , 1996 .

[5]  G. Lugosi,et al.  Consistency of Data-driven Histogram Methods for Density Estimation and Classification , 1996 .

[6]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[9]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[10]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[11]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[12]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[13]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[15]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[16]  B. Park,et al.  Estimation of Kullback–Leibler Divergence by Local Likelihood , 2006 .

[17]  Sanjeev R. Kulkarni,et al.  Universal Divergence Estimation for Finite-Alphabet Sources , 2006, IEEE Transactions on Information Theory.

[18]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[19]  Sanjeev R. Kulkarni,et al.  A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors , 2006, 2006 IEEE International Symposium on Information Theory.

[20]  Martin J. Wainwright,et al.  Nonparametric estimation of the likelihood ratio and divergence functionals , 2007, 2007 IEEE International Symposium on Information Theory.

[21]  L. Pronzato,et al.  A class of Rényi information estimators for multidimensional densities , 2008, 0810.5302.