Consistency and Generalization Bounds for Maximum Entropy Density Estimation

We investigate the statistical properties of maximum entropy density estimation, both for the complete data case and the incomplete data case. We show that under certain assumptions, the generalization error can be bounded in terms of the complexity of the underlying feature functions. This allows us to establish the universal consistency of maximum entropy density estimation.

[1]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  P. Gänssler Weak Convergence and Empirical Processes - A. W. van der Vaart; J. A. Wellner. , 1997 .

[4]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[5]  A. Barron,et al.  APPROXIMATION OF DENSITY FUNCTIONS BY SEQUENCES OF EXPONENTIAL FAMILIES , 1991 .

[6]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[7]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[9]  Tong Zhang,et al.  Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification , 2004, NIPS.

[10]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[11]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[12]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[13]  S. Geer Empirical Processes in M-Estimation , 2000 .

[14]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[15]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[16]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .

[17]  Francesco Palmieri,et al.  Objective priors from maximum entropy in data classification , 2013, Inf. Fusion.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Dale Schuurmans,et al.  Learning mixture models with the regularized latent maximum entropy principle , 2004, IEEE Transactions on Neural Networks.

[20]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[21]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[22]  Tong Zhang,et al.  Leave-One-Out Bounds for Kernel Methods , 2003, Neural Computation.

[23]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[24]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[25]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[26]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[27]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[28]  Dale Schuurmans,et al.  Combining Statistical Language Models via the Latent Maximum Entropy Principle , 2005, Machine Learning.

[29]  Dale Schuurmans,et al.  The latent maximum entropy principle , 2002, Proceedings IEEE International Symposium on Information Theory,.

[30]  D. Panchenko,et al.  Risk bounds for mixture density estimation , 2005 .

[31]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[32]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[33]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[34]  S. D. Pietra,et al.  Statistical Learning Algorithms Based on Bregman Distances , 1997 .

[35]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[36]  Dale Schuurmans,et al.  Learning Continuous Latent Variable Models with Bregman Divergences , 2003, ALT.

[37]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[38]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..