A Stochastic Subgradient Method for Distributionally Robust Non-Convex Learning

We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifying uncertainty, allowing us to compute solutions that are robust against perturbations in the population data distribution. We consider a large family of loss functions that can be non-convex and non-smooth and develop an efficient stochastic subgradient method. We prove that it converges to a point satisfying the optimality conditions. To our knowledge, this is the first method with rigorous convergence guarantees in the context of non-convex non-smooth distributionally robust stochastic optimization. Our method can achieve any desired level of robustness with little extra computational cost compared to population risk minimization. We also illustrate the performance of our algorithm on real datasets arising in convex and non-convex supervised learning problems.

[1]  R. Mifflin Semismooth and Semiconvex Functions in Constrained Optimization , 1977 .

[2]  Vladimir I. Norkin,et al.  Generalized-differentiable functions , 1980 .

[3]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[4]  H. Föllmer,et al.  Stochastic Finance: An Introduction in Discrete Time , 2002 .

[5]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[6]  He Zhang,et al.  Models and algorithms for distributionally robust least squares problems , 2013, Mathematical Programming.

[7]  Jack W. Baker,et al.  On the assessment of robustness , 2008 .

[8]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[9]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[10]  Takafumi Kanamori,et al.  A robust approach based on conditional value-at-risk measure to statistical learning problems , 2009, Eur. J. Oper. Res..

[11]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[12]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[13]  Yuichi Yoshida,et al.  Statistical Learning with Conditional Value at Risk , 2020, ArXiv.

[14]  Bertrand Melenberg,et al.  Computationally Tractable Counterparts of Distributionally Robust Constraints on Risk Measures , 2014, SIAM Rev..

[15]  Andrzej Ruszczynski,et al.  Convergence of a stochastic subgradient method with averaging for nonsmooth nonconvex constrained optimization , 2019, Optimization Letters.

[16]  George J. Pappas,et al.  Robust Deep Learning as Optimal Control: Insights and Convergence Guarantees , 2020, L4DC.

[17]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[18]  Mengdi Wang,et al.  Accelerating Stochastic Composition Optimization , 2016, NIPS.

[19]  Bin Dong,et al.  You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle , 2019, NeurIPS.

[20]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[21]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[22]  Wlodzimierz Ogryczak,et al.  On consistency of stochastic dominance and mean–semideviation models , 2001, Math. Program..

[23]  Mengdi Wang,et al.  Multilevel Stochastic Gradient Methods for Nested Composition Optimization , 2018, SIAM J. Optim..

[24]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .

[25]  Yuri M. Ermoliev,et al.  Sample Average Approximation Method for Compound Stochastic Optimization Problems , 2013, SIAM J. Optim..

[26]  A. Ruszczynski A Stochastic Subgradient Method for Nonsmooth Nonconvex Multilevel Composition Optimization , 2020, SIAM J. Control. Optim..

[27]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[28]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[29]  Wlodzimierz Ogryczak,et al.  From stochastic dominance to mean-risk models: Semideviations as risk measures , 1999, Eur. J. Oper. Res..

[30]  Recursive Optimization of Convex Risk Measures: Mean-Semideviation Models , 2018, 1804.00636.

[31]  Anthony Man-Cho So,et al.  Incremental Methods for Weakly Convex Optimization , 2019, ArXiv.

[32]  H. Robbins A Stochastic Approximation Method , 1951 .

[33]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[34]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[35]  É. Moulines,et al.  Analysis of nonsmooth stochastic approximation: the differential inclusion approach , 2018, 1805.01916.

[36]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[37]  Xi Chen,et al.  Wasserstein Distributional Robustness and Regularization in Statistical Learning , 2017, ArXiv.

[38]  A. Ruszczynski,et al.  Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[39]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[40]  Viet Anh Nguyen,et al.  Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning , 2019, Operations Research & Management Science in the Age of Analytics.

[41]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[42]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[43]  Feng Ruan,et al.  Stochastic Methods for Composite and Weakly Convex Optimization Problems , 2017, SIAM J. Optim..

[44]  Saeed Ghadimi,et al.  A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[45]  Karthik Sridharan,et al.  Uniform Convergence of Gradients for Non-Convex Learning and Optimization , 2018, NeurIPS.

[46]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[47]  A. Ruszczynski,et al.  Optimization of Risk Measures , 2006 .

[48]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  Haim Brezis,et al.  Monotonicity Methods in Hilbert Spaces and Some Applications to Nonlinear Partial Differential Equations , 1971 .

[51]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[52]  Daniel Kuhn,et al.  Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations , 2015, Mathematical Programming.

[53]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[54]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.