A Class of Parameterized Loss Functions for Classification: Optimization Tradeoffs and Robustness Characteristics.

Recently, a parametrized class of loss functions called $\alpha$-loss, $\alpha \in [1,\infty]$, has been introduced for classification. This family, which includes the log-loss and the 0-1 loss as special cases, comes with compelling properties including an equivalent margin-based form which is classification-calibrated for all $\alpha$. We introduce a generalization of this family to the entire range of $\alpha \in (0,\infty]$ and establish how the parameter $\alpha$ enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks. We prove that smaller $\alpha$ values are more conducive to faster optimization; in fact, $\alpha$-loss is convex for $\alpha \le 1$ and quasi-convex for $\alpha >1$. Moreover, we establish bounds to quantify the degradation of the local-quasi-convexity of the optimization landscape as $\alpha$ increases; we show that this directly translates to a computational slow down. On the other hand, our theoretical results also suggest that larger $\alpha$ values lead to better generalization performance. This is a consequence of the ability of the $\alpha$-loss to limit the effect of less likely data as $\alpha$ increases from 1, thereby facilitating robustness to outliers and noise in the training data. We provide strong evidence supporting this assertion with several experiments on benchmark datasets that establish the efficacy of $\alpha$-loss for $\alpha > 1$ in robustness to errors in the training data. Of equal interest is the fact that, for $\alpha < 1$, our experiments show that the decreased robustness seems to counteract class imbalances in training data.

[1]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[2]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[3]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[4]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[5]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[6]  Frank Nielsen,et al.  On the Efficient Minimization of Classification Calibrated Surrogates , 2008, NIPS.

[7]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[8]  Peter Kairouz,et al.  A Tunable Loss Function for Binary Classification , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[9]  Aleksander Madry,et al.  Adversarially Robust Generalization Requires More Data , 2018, NeurIPS.

[10]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[11]  Alexander J. Smola,et al.  Tighter Bounds for Structured Estimation , 2008, NIPS.

[12]  Zhi-Hua Zhou,et al.  On the Consistency of Multi-Label Learning , 2011, COLT.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Scott Sanner,et al.  Algorithms for Direct 0-1 Loss Optimization in Binary Classification , 2013, ICML.

[15]  Aditi Raghunathan,et al.  Adversarial Training Can Hurt Generalization , 2019, ArXiv.

[16]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.

[17]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[18]  Oliver Kosut,et al.  A Tunable Measure for Information Leakage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[19]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[20]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[21]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[22]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[23]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[24]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[25]  Ram Rajagopal,et al.  Context-Aware Generative Adversarial Privacy , 2017, Entropy.

[26]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[27]  Kannan Ramchandran,et al.  Rademacher Complexity for Adversarially Robust Generalization , 2018, ICML.

[28]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[29]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Musa A. Mammadov,et al.  From Convex to Nonconvex: A Loss Function Analysis for Binary Classification , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[32]  José Carlos Príncipe,et al.  A loss function for classification based on a robust similarity metric , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[33]  Lucas Benigni,et al.  Eigenvalue distribution of nonlinear models of random matrices , 2019, ArXiv.

[34]  Wojciech Czarnecki,et al.  On Loss Functions for Deep Neural Networks in Classification , 2017, ArXiv.

[35]  Yufeng Liu,et al.  Robust Truncated Hinge Loss Support Vector Machines , 2007 .

[36]  Martin J. Wainwright,et al.  ON surrogate loss functions and f-divergences , 2005, math/0510521.

[37]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.