Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization

Distributionally robust supervised learning (DRSL) is emerging as a key paradigm for building reliable machine learning systems for real-world applications—reflecting the need for classifiers and predictive models that are robust to the distribution shifts that arise from phenomena such as selection bias or nonstationarity. Existing algorithms for solving Wasserstein DRSL— one of the most popular DRSL frameworks based around robustness to perturbations in the Wasserstein distance—have serious limitations that limit their use in large-scale problems—in particular they involve solving complex subproblems and they fail to make use of stochastic gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. We demonstrate their effectiveness on synthetic and real data when compared to existing DRSL approaches. Key to our results is the use of variance reduction and random reshuffling to accelerate stochastic min-max optimization, the analysis of which may be of independent interest.

[1]  Konstantin Mishchenko,et al.  Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[2]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[3]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[4]  Niao He,et al.  A Catalyst Framework for Minimax Optimization , 2020, NeurIPS.

[5]  Wei Liu,et al.  Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization , 2020, NeurIPS.

[6]  M. KarthyekRajhaaA.,et al.  Robust Wasserstein profile inference and applications to machine learning , 2019, J. Appl. Probab..

[7]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[8]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[9]  Yinyu Ye,et al.  Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems , 2010, Oper. Res..

[10]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[11]  Guanghui Lan,et al.  Efficient Algorithms for Distributionally Robust Stochastic Optimization with Discrete Scenario Support , 2021, SIAM J. Optim..

[12]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[13]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[14]  Constantinos Daskalakis,et al.  Training GANs with Optimism , 2017, ICLR.

[15]  Panayotis Mertikopoulos,et al.  On the convergence of single-call stochastic extra-gradient methods , 2019, NeurIPS.

[16]  Yunmei Chen,et al.  Accelerated schemes for a class of variational inequalities , 2014, Mathematical Programming.

[17]  Sanjay Mehrotra,et al.  Decomposition Algorithm for Distributionally Robust Optimization using Wasserstein Metric , 2017, 1704.03920.

[18]  Sanjay Mehrotra,et al.  Distributionally Robust Optimization: A Review , 2019, ArXiv.

[19]  Alfredo N. Iusem,et al.  Extragradient Method with Variance Reduction for Stochastic Variational Inequalities , 2017, SIAM J. Optim..

[20]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[21]  Sanjay Mehrotra,et al.  A Distributionally-robust approach for finding support vector machine , 2015 .

[22]  Felix Famoye,et al.  Generalized logistic distribution and its regression model , 2020, Journal of Statistical Distributions and Applications.

[23]  M. Sion On general minimax theorems , 1958 .

[24]  J. Hardin,et al.  Generalized Linear Models and Extensions , 2001 .

[25]  Marten van Dijk,et al.  A Unified Convergence Analysis for Shuffling-Type Gradient Methods , 2020, J. Mach. Learn. Res..

[26]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[27]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Mathematical Programming Computation.

[28]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[29]  Karthyek Murthy,et al.  Optimal Transport-Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes , 2018, Math. Oper. Res..

[30]  Kevin Tian,et al.  Variance Reduction for Matrix Games , 2019, NeurIPS.

[31]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[32]  S. Shankar Sastry,et al.  On Finding Local Nash Equilibria (and Only Local Nash Equilibria) in Zero-Sum Games , 2019, 1901.00838.

[33]  Yura Malitsky,et al.  Stochastic Variance Reduction for Variational Inequality Methods , 2021, ArXiv.

[34]  Guanghui Lan,et al.  Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , 2020, SIAM J. Optim..

[35]  Yunmei Chen,et al.  Optimal Primal-Dual Methods for a Class of Saddle Point Problems , 2013, SIAM J. Optim..

[36]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[37]  Wei Hu,et al.  Linear Convergence of the Primal-Dual Gradient Method for Convex-Concave Saddle Point Problems without Strong Convexity , 2018, AISTATS.

[38]  J. Andrew Bagnell,et al.  Robust Supervised Learning , 2005, AAAI.

[39]  M. C. Borja,et al.  An Introduction to Generalized Linear Models , 2009 .

[40]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[41]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[42]  Daniel Kuhn,et al.  Distributionally Robust Convex Optimization , 2014, Oper. Res..

[43]  Francis R. Bach,et al.  Stochastic Variance Reduction Methods for Saddle-Point Problems , 2016, NIPS.

[44]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[45]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[46]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[47]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.

[48]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[49]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[50]  Anthony Man-Cho So,et al.  Fast Epigraphical Projection-based Incremental Algorithms for Wasserstein Distributionally Robust Support Vector Machine , 2020, NeurIPS.

[51]  Yongchao Liu,et al.  Primal-dual hybrid gradient method for distributionally robust optimization problems , 2017, Oper. Res. Lett..

[52]  Uday V. Shanbhag,et al.  On the Analysis of Variance-reduced and Randomized Projection Variants of Single Projection Schemes for Monotone Stochastic Variational Inequality Problems , 2019 .

[53]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[54]  Ohad Shamir,et al.  How Good is SGD with Random Shuffling? , 2019, COLT.

[55]  Xi Chen,et al.  Wasserstein Distributional Robustness and Regularization in Statistical Learning , 2017, 1712.06050.

[56]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[57]  Aryan Mokhtari,et al.  A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach , 2019, AISTATS.

[58]  A. Shapiro ON DUALITY THEORY OF CONIC LINEAR PROBLEMS , 2001 .

[59]  Anthony Man-Cho So,et al.  A First-Order Algorithmic Framework for Distributionally Robust Logistic Regression , 2019, NeurIPS.

[60]  Yurii Nesterov,et al.  Dual extrapolation and its applications to solving variational inequalities and related problems , 2003, Math. Program..

[61]  Yaoliang Yu,et al.  Bregman Divergence for Stochastic Variance Reduction: Saddle-Point and Adversarial Prediction , 2017, NIPS.

[62]  Dimitris Papailiopoulos,et al.  Closing the convergence gap of SGD without replacement , 2020, ICML.

[63]  Guangzeng Xie,et al.  Lower Complexity Bounds for Finite-Sum Convex-Concave Minimax Optimization Problems , 2020, ICML.

[64]  S. Shankar Sastry,et al.  On Gradient-Based Learning in Continuous Games , 2018, SIAM J. Math. Data Sci..

[65]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[66]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[67]  Daniel Kuhn,et al.  Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations , 2015, Mathematical Programming.

[68]  Michael I. Jordan,et al.  Near-Optimal Algorithms for Minimax Optimization , 2020, COLT.

[69]  Daniel Kuhn,et al.  Regularization via Mass Transportation , 2017, J. Mach. Learn. Res..

[70]  Karthyek R. A. Murthy,et al.  Quantifying Distributional Model Risk Via Optimal Transport , 2016, Math. Oper. Res..

[71]  Tatjana Chavdarova,et al.  Reducing Noise in GAN Training with Variance Reduced Extragradient , 2019, NeurIPS.

[72]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[73]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[74]  John C. Duchi,et al.  Certifiable Distributional Robustness with Principled Adversarial Training , 2017, ArXiv.

[75]  Tengyuan Liang,et al.  Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks , 2018, AISTATS.

[76]  Therese A. Stukel,et al.  Generalized logistic models , 1988 .

[77]  Prateek Jain,et al.  Efficient Algorithms for Smooth Minimax Optimization , 2019, NeurIPS.

[78]  Panos M. Pardalos,et al.  Convex optimization theory , 2010, Optim. Methods Softw..