Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization

We develop and analyze M-estimation methods for divergence functionals and the likelihood ratios of two probability distributions. Our method is based on a nonasymptotic variational characterization of f -divergences, which allows the problem of estimating divergences to be tackled via convex empirical risk optimization. The resulting estimators are simple to implement, requiring only the solution of standard convex programs. We present an analysis of consistency and convergence for these estimators. Given conditions only on the ratios of densities, we show that our estimators can achieve optimal minimax rates for the likelihood ratio and the divergence functionals in certain regimes. We derive an efficient optimization algorithm for computing our estimates, and illustrate their convergence behavior and practical viability by simulations.

[1]  G. C. Hood Estimation of Entropy , 1953 .

[2]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[3]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[4]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[5]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[6]  B. Silverman,et al.  On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method , 1982 .

[7]  I. Ibragimov,et al.  On Nonparametric Estimation of the Value of a Linear Functional in Gaussian White Noise , 1985 .

[8]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .

[9]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[10]  H. Joe Estimation of entropy and other functionals of a multivariate density , 1989 .

[11]  H. Joe Relative Entropy Measures of Multivariate Dependence , 1989 .

[12]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  P. Comon Independent Component Analysis , 1992 .

[15]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[16]  J. Tsitsiklis Decentralized Detection' , 1993 .

[17]  P. Hall,et al.  On the estimation of entropy , 1993 .

[18]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[19]  P. Massart,et al.  Estimation of Integral Functionals of a Density , 1995 .

[20]  B. Laurent Efficient estimation of integral functionals of a density , 1996 .

[21]  G. Kerkyacharian,et al.  Estimating nonquadratic functionals of a density using Haar wavelets , 1996 .

[22]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[23]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[24]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[25]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[26]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[27]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[28]  S. Geer Empirical Processes in M-Estimation , 2000 .

[29]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[30]  E. Oja,et al.  Independent Component Analysis , 2013 .

[31]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[32]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[33]  A. Keziou Dual representation of Φ-divergences and applications , 2003 .

[34]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[35]  Martin J. Wainwright,et al.  ON surrogate loss functions and f-divergences , 2005, math/0510521.

[36]  Martin J. Wainwright,et al.  On divergences, surrogate loss functions, and decentralized detection , 2005, ArXiv.

[37]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[38]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[39]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[40]  Sanjeev R. Kulkarni,et al.  A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors , 2006, 2006 IEEE International Symposium on Information Theory.

[41]  Martin J. Wainwright,et al.  Nonparametric estimation of the likelihood ratio and divergence functionals , 2007, 2007 IEEE International Symposium on Information Theory.

[42]  Martin J. Wainwright,et al.  Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization , 2007, NIPS.

[43]  A. Keziou,et al.  On empirical likelihood for semiparametric two-sample density ratio models , 2008 .

[44]  D. Donoho,et al.  Geometrizing Rates of Convergence , II , 2008 .

[45]  Le Song,et al.  Relative Novelty Detection , 2009, AISTATS.

[46]  Michel Broniatowski,et al.  Parametric estimation and tests through divergences and the duality technique , 2008, J. Multivar. Anal..

[47]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.