Tilted Empirical Risk Minimization

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

[1]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[2]  Qi Xie,et al.  Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , 2019, NeurIPS.

[3]  Michael I. Jordan,et al.  Minmax Optimization: Stable Limit Points of Gradient Descent Ascent are Locally Optimal , 2019, ArXiv.

[4]  Yaoliang Yu,et al.  A Polynomial-time Form of Robust Regression , 2012, NIPS.

[5]  Johannes O. Royset,et al.  On Solving Large-Scale Finite Minimax Problems Using Exponential Smoothing , 2011, J. Optim. Theory Appl..

[6]  François Fleuret,et al.  Biased Importance Sampling for Deep Neural Network Training , 2017, ArXiv.

[7]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[8]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[9]  Anima Anandkumar,et al.  Learning From Noisy Singly-labeled Data , 2017, ICLR.

[10]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  Jong-Shi Pang,et al.  On the pervasiveness of difference-convexity in optimization and statistics , 2017, Math. Program..

[13]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[14]  Crina Grosan,et al.  Meta-QSAR: a large-scale application of meta-learning to drug design and discovery , 2017, Machine Learning.

[15]  G. Bennett Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[16]  Brian D. Ziebart,et al.  Fair Logistic Regression: An Adversarial Perspective , 2019, ArXiv.

[17]  Mohit Singh,et al.  Multi-Criteria Dimensionality Reduction with Applications to Fairness , 2019, NeurIPS.

[18]  Beng Chin Ooi,et al.  Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale , 2015, ArXiv.

[19]  Nihar B. Shah,et al.  PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review , 2018, ALT.

[20]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[21]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[22]  Yassine Laguel,et al.  Device Heterogeneity in Federated Learning: A Superquantile Approach , 2020, ArXiv.

[23]  Mohit Singh,et al.  The Price of Fair PCA: One Extra Dimension , 2018, NeurIPS.

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  Yu Hen Hu,et al.  Vehicle classification in distributed sensor networks , 2004, J. Parallel Distributed Comput..

[26]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[27]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[28]  Silvio Savarese,et al.  Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[29]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[30]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[31]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[32]  Kevin Gimpel,et al.  Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise , 2018, NeurIPS.

[33]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[34]  Amnon Shashua,et al.  SimNets: A Generalization of Convolutional Networks , 2014, ArXiv.

[35]  H. Weyl Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung) , 1912 .

[36]  Tian Li,et al.  Fair Resource Allocation in Federated Learning , 2019, ICLR.

[37]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[38]  Krishna P. Gummadi,et al.  Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment , 2016, WWW.

[39]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[40]  Martha White,et al.  Relaxed Clipping: A Global Training Method for Robust Regression and Classification , 2010, NIPS.

[41]  Jason D. Lee,et al.  Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods , 2019, NeurIPS.

[42]  Jeffrey Dean,et al.  Accelerating Deep Learning by Focusing on the Biggest Losers , 2019, ArXiv.

[43]  Meisam Razaviyayn,et al.  Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems , 2021, SIAM J. Optim..

[44]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[45]  Kenneth Ward Church,et al.  Long-tail Visual Relationship Recognition with a Visiolinguistic Hubless Loss , 2020, ArXiv.

[46]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[47]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  D. Bertsekas,et al.  A new penalty function method for constrained minimization , 1972, CDC 1972.

[49]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[50]  Tim Oates,et al.  Adaptive Normalized Risk-Averting Training for Deep Neural Networks , 2016, AAAI.

[51]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[52]  Farzin Haddadpour,et al.  Efficient Fair Principal Component Analysis , 2019, ArXiv.

[53]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[54]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[55]  Amnon Shashua,et al.  Deep SimNets , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[57]  Changho Suh,et al.  FR-Train: A mutual information-based approach to fair and robust training , 2020, ICML.

[58]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[61]  Meisam Razaviyayn,et al.  Rényi Fair Inference , 2019, ICLR.

[62]  R. Rockafellar,et al.  Conditional Value-at-Risk for General Loss Distributions , 2001 .

[63]  Shai Ben-David,et al.  Empirical Risk Minimization under Fairness Constraints , 2018, NeurIPS.

[64]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[65]  Ken R. Duffy,et al.  A Characterization of Guesswork on Swiftly Tilting Curves , 2018, IEEE Transactions on Information Theory.

[66]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[67]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[68]  R. Pace,et al.  Sparse spatial autoregressions , 1997 .

[69]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[70]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Prateek Jain,et al.  Consistent Robust Regression , 2017, NIPS.

[72]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.