论文信息 - Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms - 字舞流文

Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms

This paper studies Fenchel-Young losses, a generic way to construct convex loss functions from a regularization function. We analyze their properties in depth, showing that they unify many well-known loss functions and allow to create useful new ones easily. Fenchel-Young losses constructed from a generalized entropy, including the Shannon and Tsallis entropies, induce predictive probability distributions. We formulate conditions for a generalized entropy to yield losses with a separation margin, and probability distributions with sparse support. Finally, we derive efficient algorithms, making Fenchel-Young losses appealing both in theory and practice.

André F. T. Martins | Vlad Niculae | Mathieu Blondel | Mathieu Blondel | Vlad Niculae

[1] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2] Eric P. Xing,et al. Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[3] J. Borwein,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[4] W. Berger,et al. Diversity of Planktonic Foraminifera in Deep-Sea Sediments , 1970, Science.

[5] 丸山徹. Convex Analysisの二,三の進展について , 1977 .

[6] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[7] J. F. C. Kingman,et al. Information and Exponential Families in Statistical Theory , 1980 .

[8] Claude E. Shannon,et al. A mathematical theory of communication , 1948, MOCO.

[9] Mark D. Reid,et al. Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[10] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[11] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12] Vlad Niculae,et al. A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[13] Marc Teboulle,et al. Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[14] Hamed Masnadi-Shirazi. The design of Bayes consistent loss functions for classification , 2011 .

[15] Arthur Mensch,et al. Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[16] K. Schittkowski,et al. NONLINEAR PROGRAMMING , 2022 .

[17] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[18] Martin Jaggi,et al. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[19] I JordanMichael,et al. Graphical Models, Exponential Families, and Variational Inference , 2008 .

[20] Heinz H. Bauschke,et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[21] Alexandre M. Bayen,et al. Efficient Bregman projections onto the simplex , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[22] O. Mangasarian. PSEUDO-CONVEX FUNCTIONS , 1965 .

[23] Alexander J. Smola,et al. Learning with kernels , 1998 .

[24] Koby Crammer,et al. On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[25] A. Dawid,et al. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[26] Eric R. Ziegel,et al. Generalized Linear Models , 2002, Technometrics.

[27] Ramón Fernández Astudillo,et al. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[28] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[29] C. Tsallis,et al. Nonextensive Entropy: Interdisciplinary Applications , 2004 .

[30] J. Danskin. The Theory of Max-Min, with Applications , 1966 .

[31] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[33] Yurii Nesterov,et al. Smooth minimization of non-smooth functions , 2005, Math. Program..

[34] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[35] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[36] Laurent Condat. Fast projection onto the simplex and the l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pmb {l}_\mathbf {1}$$\end{ , 2015, Mathematical Programming.

[37] P. Brucker. Review of recent development: An O( n) algorithm for quadratic knapsack problems , 1984 .

[38] F. Opitz. Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[39] A. Raftery,et al. Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[40] K. Ball,et al. Sharp uniform convexity and smoothness inequalities for trace norms , 1994 .

[41] Yoram Singer,et al. Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[42] Claire Cardie,et al. SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[43] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[44] Yann Guermeur,et al. VC Theory of Large Margin Multi-Category Classifiers , 2007, J. Mach. Learn. Res..

[45] Stephen J. Wright,et al. Sparse reconstruction by separable approximation , 2009, IEEE Trans. Signal Process..

[46] L. J. Savage. Elicitation of Personal Probabilities and Expectations , 1971 .

[47] Manfred K. Warmuth,et al. Two-temperature logistic regression based on the Tsallis divergence , 2017, AISTATS.

[48] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[49] Hiroki Suyari. Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy , 2004, IEEE Transactions on Information Theory.

[50] C. Gini. Variabilita e Mutabilita. , 1913 .

[51] G. Crooks. On Measures of Entropy and Information , 2015 .

[52] André F. T. Martins,et al. Learning with Fenchel-Young Losses , 2020, J. Mach. Learn. Res..

[53] M. Degroot. Uncertainty, Information, and Sequential Experiments , 1962 .

[54] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[55] Mark D. Reid,et al. Composite Binary Losses , 2009, J. Mach. Learn. Res..

[56] A. Buja,et al. Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[57] Frank Nielsen,et al. Bregman Divergences and Surrogates for Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.