Generalized Maximum Entropy for Supervised Classification

The maximum entropy principle advocates to evaluate events' probabilities using a distribution that maximizes entropy among those that satisfy certain expectations' constraints. Such principle can be generalized for arbitrary decision problems where it corresponds to minimax approaches. This paper establishes a framework for supervised classification based on the generalized maximum entropy principle that leads to minimax risk classifiers (MRCs). We develop learning techniques that determine MRCs for general entropy functions and provide performance guarantees by means of convex optimization. In addition, we describe the relationship of the presented techniques with existing classification methods, and quantify MRCs performance in comparison with the proposed bounds and conventional methods.

[1]  Moe Z. Win,et al.  Belief Condensation Filtering , 2013, IEEE Transactions on Signal Processing.

[2]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[3]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[4]  Gregory W. Wornell,et al.  On the Universality of the Logistic Loss Function , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[5]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[6]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[7]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[8]  Peter Harremoës,et al.  Maximum Entropy Fundamentals , 2001, Entropy.

[9]  Peter Kairouz,et al.  A Tunable Loss Function for Binary Classification , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[10]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[11]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[12]  Oliver Kosut,et al.  A Tunable Measure for Information Leakage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[13]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Peter Kairouz,et al.  A Tunable Loss Function for Classification , 2019, ArXiv.

[16]  Naftali Tishby,et al.  The Minimum Information Principle for Discriminative Learning , 2004, UAI.

[17]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[18]  John Lygeros,et al.  Generalized maximum entropy estimation , 2017, J. Mach. Learn. Res..

[19]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20]  Stephen Simons,et al.  Minimax Theorems and Their Proofs , 1995 .

[21]  B. Buck,et al.  Maximum entropy in action : a collection of expository essays , 1991 .

[22]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[23]  Santiago Mazuelas,et al.  Supervised classification via minimax probabilistic transformations , 2019, ArXiv.

[24]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[25]  Peter Kairouz,et al.  A Class of Parameterized Loss Functions for Classification: Optimization Tradeoffs and Robustness Characteristics. , 2019 .

[26]  Lorenzo Rosasco,et al.  Elastic-net regularization in learning theory , 2008, J. Complex..

[27]  Rodney W. Johnson,et al.  Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy , 1980, IEEE Trans. Inf. Theory.

[28]  Stephen P. Boyd,et al.  Disciplined Convex Programming , 2006 .

[29]  Brian D. Ziebart,et al.  Adversarial Multiclass Classification: A Risk Minimization Perspective , 2016, NIPS.

[30]  Mehryar Mohri,et al.  Structural Maxent Models , 2015, ICML.

[31]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[32]  John M. Beggs,et al.  A Maximum Entropy Model Applied to Spatial and Temporal Correlations from Cortical Networks In Vitro , 2008, The Journal of Neuroscience.

[33]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[34]  Stephen A. Smith,et al.  A Derivation of Entropy and the Maximum Entropy Criterion in the Context of Decision Problems , 1974, IEEE Trans. Syst. Man Cybern..

[35]  Peter D. Grunwald,et al.  Maximum Entropy and the Glasses You Are Looking Through , 2013, 1301.3860.

[36]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[37]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[38]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[39]  T. Ulrych,et al.  Minimum relative entropy and probabilistic inversion in groundwater hydrology , 1998 .

[40]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[41]  David Tse,et al.  A Minimax Approach to Supervised Learning , 2016, NIPS.

[42]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[43]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[44]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[45]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[46]  I. Csiszár Maxent, Mathematics, and Information Theory , 1996 .

[47]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .