Tilted Empirical Risk Minimization

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

[1]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[2]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[3]  Yassine Laguel,et al.  Device Heterogeneity in Federated Learning: A Superquantile Approach , 2020, ArXiv.

[4]  Nihar B. Shah,et al.  PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review , 2018, ALT.

[5]  Pradeep Ravikumar,et al.  On Human-Aligned Risk Minimization , 2019, NeurIPS.

[6]  Brian D. Ziebart,et al.  Fair Logistic Regression: An Adversarial Perspective , 2019, ArXiv.

[7]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[8]  Amnon Shashua,et al.  SimNets: A Generalization of Convolutional Networks , 2014, ArXiv.

[9]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[10]  Johannes O. Royset,et al.  On Solving Large-Scale Finite Minimax Problems Using Exponential Smoothing , 2011, J. Optim. Theory Appl..

[11]  Shai Ben-David,et al.  Empirical Risk Minimization under Fairness Constraints , 2018, NeurIPS.

[12]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[14]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[15]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[16]  Anima Anandkumar,et al.  Learning From Noisy Singly-labeled Data , 2017, ICLR.

[17]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[18]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[19]  Tim Oates,et al.  Adaptive Normalized Risk-Averting Training for Deep Neural Networks , 2016, AAAI.

[20]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  D. Bertsekas,et al.  A new penalty function method for constrained minimization , 1972, CDC 1972.

[22]  Jong-Shi Pang,et al.  On the pervasiveness of difference-convexity in optimization and statistics , 2017, Math. Program..

[23]  Brian D. Ziebart,et al.  Fairness for Robust Log Loss Classification , 2019, AAAI.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Jan Peters,et al.  Entropic Risk Measure in Policy Search , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[27]  François Fleuret,et al.  Biased Importance Sampling for Deep Neural Network Training , 2017, ArXiv.

[28]  Amnon Shashua,et al.  Deep SimNets , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Changho Suh,et al.  FR-Train: A mutual information-based approach to fair and robust training , 2020, ICML.

[30]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[31]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[32]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[33]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[34]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.

[36]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[37]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[38]  Kevin Gimpel,et al.  Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise , 2018, NeurIPS.

[39]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[40]  Qi Xie,et al.  Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , 2019, NeurIPS.

[41]  Kenneth Ward Church,et al.  Long-tail Visual Relationship Recognition with a Visiolinguistic Hubless Loss , 2020, ArXiv.

[42]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Mohit Singh,et al.  Multi-Criteria Dimensionality Reduction with Applications to Fairness , 2019, NeurIPS.

[44]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[45]  Michael I. Jordan,et al.  What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization? , 2019, ICML.

[46]  Martha White,et al.  Relaxed Clipping: A Global Training Method for Robust Regression and Classification , 2010, NIPS.

[47]  Chunhua Shen,et al.  On the Dual Formulation of Boosting Algorithms , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jason D. Lee,et al.  Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods , 2019, NeurIPS.

[49]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[50]  Heping Zhang,et al.  Robust Variable Selection With Exponential Squared Loss , 2013, Journal of the American Statistical Association.

[51]  Farzin Haddadpour,et al.  Efficient Fair Principal Component Analysis , 2019, ArXiv.

[52]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[53]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[54]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[55]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[56]  R. Pace,et al.  Sparse spatial autoregressions , 1997 .

[57]  Meisam Razaviyayn,et al.  Rényi Fair Inference , 2019, ICLR.

[58]  G. Bennett Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[59]  Beng Chin Ooi,et al.  Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale , 2015, ArXiv.

[60]  Meisam Razaviyayn,et al.  Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems , 2021, SIAM J. Optim..

[61]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[62]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[63]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[64]  Silvio Savarese,et al.  Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[65]  Crina Grosan,et al.  Meta-QSAR: a large-scale application of meta-learning to drug design and discovery , 2017, Machine Learning.

[66]  Prateek Jain,et al.  Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems , 2019, AISTATS.

[67]  Yaoliang Yu,et al.  A Polynomial-time Form of Robust Regression , 2012, NIPS.

[68]  Martin Connors,et al.  Optimization Models , 2014 .

[69]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[70]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[71]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[72]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[73]  Tian Li,et al.  Fair Resource Allocation in Federated Learning , 2019, ICLR.

[74]  Jeffrey Dean,et al.  Accelerating Deep Learning by Focusing on the Biggest Losers , 2019, ArXiv.

[75]  Yu Hen Hu,et al.  Vehicle classification in distributed sensor networks , 2004, J. Parallel Distributed Comput..

[76]  Ken R. Duffy,et al.  A Characterization of Guesswork on Swiftly Tilting Curves , 2018, IEEE Transactions on Information Theory.

[77]  H. Weyl Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung) , 1912 .

[78]  Michael I. Jordan,et al.  Minmax Optimization: Stable Limit Points of Gradient Descent Ascent are Locally Optimal , 2019, ArXiv.

[79]  Kazushi Ikeda,et al.  Better generalization with less data using robust gradient descent , 2019, ICML.

[80]  Krishna P. Gummadi,et al.  Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment , 2016, WWW.

[81]  Z. Harchaoui,et al.  A Superquantile Approach to Federated Learning with Heterogeneous Devices , 2021, 2021 55th Annual Conference on Information Sciences and Systems (CISS).

[82]  Prateek Jain,et al.  Consistent Robust Regression , 2017, NIPS.

[83]  R. Rockafellar,et al.  Conditional Value-at-Risk for General Loss Distributions , 2001 .

[84]  Mohit Singh,et al.  The Price of Fair PCA: One Extra Dimension , 2018, NeurIPS.

[85]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[86]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[87]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .