Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms

This paper studies Fenchel-Young losses, a generic way to construct convex loss functions from a regularization function. We analyze their properties in depth, showing that they unify many well-known loss functions and allow to create useful new ones easily. Fenchel-Young losses constructed from a generalized entropy, including the Shannon and Tsallis entropies, induce predictive probability distributions. We formulate conditions for a generalized entropy to yield losses with a separation margin, and probability distributions with sparse support. Finally, we derive efficient algorithms, making Fenchel-Young losses appealing both in theory and practice.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  Eric P. Xing,et al.  Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[3]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[4]  W. Berger,et al.  Diversity of Planktonic Foraminifera in Deep-Sea Sediments , 1970, Science.

[5]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[6]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[7]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[8]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  Mark D. Reid,et al.  Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[10]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[13]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[14]  Hamed Masnadi-Shirazi The design of Bayes consistent loss functions for classification , 2011 .

[15]  Arthur Mensch,et al.  Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[16]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[17]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[18]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[19]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[20]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[21]  Alexandre M. Bayen,et al.  Efficient Bregman projections onto the simplex , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[22]  O. Mangasarian PSEUDO-CONVEX FUNCTIONS , 1965 .

[23]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[24]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[25]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[26]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[27]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[28]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[29]  C. Tsallis,et al.  Nonextensive Entropy: Interdisciplinary Applications , 2004 .

[30]  J. Danskin The Theory of Max-Min, with Applications , 1966 .

[31]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[33]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[34]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[35]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[36]  Laurent Condat Fast projection onto the simplex and the l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pmb {l}_\mathbf {1}$$\end{ , 2015, Mathematical Programming.

[37]  P. Brucker Review of recent development: An O( n) algorithm for quadratic knapsack problems , 1984 .

[38]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[39]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[40]  K. Ball,et al.  Sharp uniform convexity and smoothness inequalities for trace norms , 1994 .

[41]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[42]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[43]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[44]  Yann Guermeur,et al.  VC Theory of Large Margin Multi-Category Classifiers , 2007, J. Mach. Learn. Res..

[45]  Stephen J. Wright,et al.  Sparse reconstruction by separable approximation , 2009, IEEE Trans. Signal Process..

[46]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[47]  Manfred K. Warmuth,et al.  Two-temperature logistic regression based on the Tsallis divergence , 2017, AISTATS.

[48]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[49]  Hiroki Suyari Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy , 2004, IEEE Transactions on Information Theory.

[50]  C. Gini Variabilita e Mutabilita. , 1913 .

[51]  G. Crooks On Measures of Entropy and Information , 2015 .

[52]  André F. T. Martins,et al.  Learning with Fenchel-Young Losses , 2020, J. Mach. Learn. Res..

[53]  M. Degroot Uncertainty, Information, and Sequential Experiments , 1962 .

[54]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[55]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[56]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[57]  Frank Nielsen,et al.  Bregman Divergences and Surrogates for Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.