Stochastic Subgradient Method Converges on Tame Functions

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science—including all popular deep learning architectures.

[1]  H. Whitney A Function Not Constant on a Connected Set of Critical Points , 1935 .

[2]  F. Downton Stochastic Approximation , 1969, Nature.

[3]  M. T. Wasan Stochastic Approximation , 1969 .

[4]  J. L. Webb OPERATEURS MAXIMAUX MONOTONES ET SEMI‐GROUPES DE CONTRACTIONS DANS LES ESPACES DE HILBERT , 1974 .

[5]  E. A. Nurminskii Minimization of nondifferentiable functions in the presence of noise , 1974 .

[6]  Ronald E. Bruck Asymptotic convergence of nonlinear contraction semigroups in Hilbert space , 1975 .

[7]  F. Clarke Generalized gradients and applications , 1975 .

[8]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[9]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[10]  K. H. Kim The theory of subgradients and its applications to problems of optimization: Convex and nonconvex functions: R.T. Rockafeller, Berlin: Heldermann Verlag, 1981. pp. 107, DM 28.00/$12.00 , 1983 .

[11]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[12]  A. Wilkie Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function , 1996 .

[13]  L. Dries,et al.  Geometric categories and o-minimal structures , 1996 .

[14]  L. van den Dries,et al.  Tame Topology and O-minimal Structures , 1998 .

[15]  Yu. S. Ledyaev,et al.  Nonsmooth analysis and control theory , 1998 .

[16]  R. Tyrrell Rockafellar,et al.  Variational Analysis , 1998, Grundlehren der mathematischen Wissenschaften.

[17]  J. Borwein,et al.  Lipschitz functions with maximal Clarke subdifferentials are generic , 2000 .

[18]  M. Coste AN INTRODUCTION TO O-MINIMAL GEOMETRY , 2002 .

[19]  M. Coste AN INTRODUCTION TO SEMIALGEBRAIC GEOMETRY , 2002 .

[20]  G. Smirnov Introduction to the Theory of Differential Inclusions , 2002 .

[21]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[22]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[23]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions II: Applications , 2005 .

[24]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions, Part II: Applications , 2006, Math. Oper. Res..

[25]  H. Robbins A Stochastic Approximation Method , 1951 .

[26]  J. K. Hunter,et al.  Measure Theory , 2007 .

[27]  Adrian S. Lewis,et al.  Clarke Subgradients of Stratifiable Functions , 2006, SIAM J. Optim..

[28]  A. Ioffe Critical values of set-valued maps with stratifiable graphs. Extensions of Sard and Smale-Sard theorems , 2008 .

[29]  A. D. Ioffe,et al.  An Invitation to Tame Optimization , 2008, SIAM J. Optim..

[30]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[31]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[32]  James C. Sutherland,et al.  Graph-Based Software Design for Managing Complexity and Enabling Concurrency in Multiphysics PDE Software , 2011, TOMS.

[33]  Paul I. Barton,et al.  Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function , 2013, TOMS.

[34]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[35]  Dmitriy Drusvyatskiy,et al.  Curves of Descent , 2012, SIAM J. Control. Optim..

[36]  Paul I. Barton,et al.  A vector forward mode of automatic differentiation for generalized derivative evaluation , 2015, Optim. Methods Softw..

[37]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[38]  A. Ioffe Variational Analysis of Regular Mappings: Theory and Applications , 2017 .

[39]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[40]  John C. Duchi,et al.  Stochastic Methods for Composite Optimization Problems , 2017 .

[41]  A. Ioffe Variational Analysis of Regular Mappings , 2017 .

[42]  É. Moulines,et al.  Analysis of nonsmooth stochastic approximation: the differential inclusion approach , 2018, 1805.01916.

[43]  Dmitriy Drusvyatskiy,et al.  Stochastic subgradient method converges at the rate O(k-1/4) on weakly convex functions , 2018, ArXiv.

[44]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..