Unifying Divergence Minimization and Statistical Inference Via Convex Duality

In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation as a special case. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently.

[1]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[2]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming and Applications , 1983, ISMP.

[3]  C. Atkinson METHODS FOR SOLVING INCORRECTLY POSED PROBLEMS , 1985 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  William Bialek,et al.  Statistics of Natural Images: Scaling in the Woods , 1993, NIPS.

[6]  Radford M. Neal Priors for Infinite Networks , 1996 .

[7]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[8]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[9]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[10]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[11]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[12]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[13]  William Bialek,et al.  Occam factors and model-independent Bayesian learning of continuous distributions , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[17]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[18]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[19]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[20]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[21]  Alexander J. Smola,et al.  Heteroscedastic Gaussian process regression , 2005, ICML.

[22]  J. Borwein,et al.  Techniques of variational analysis , 2005 .

[23]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[24]  Michael P. Friedlander,et al.  On minimizing distortion and relative entropy , 2006, IEEE Transactions on Information Theory.

[25]  Miroslav Dudík,et al.  Maximum Entropy Distribution Estimation with Generalized Regularization , 2006, COLT.

[26]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..