Efficient Nonconvex Empirical Risk Minimization via Adaptive Sample Size Methods

In this paper, we are interested in finding a local minimizer of an empirical risk minimization (ERM) problem where the loss associated with each sample is possibly a nonconvex function. Unlike traditional deterministic and stochastic algorithms that attempt to solve the ERM problem for the full training set, we propose an adaptive sample size scheme to reduce the overall computational complexity of finding a local minimum. To be more precise, we first find an approximate local minimum of the ERM problem corresponding to a small number of samples and use the uniform convergence theory to show that if the population risk is a Morse function, by properly increasing the size of training set the iterates generated by the proposed procedure always stay close to a local minimum of the corresponding ERM problem. Therefore, eventually, the proposed procedure finds a local minimum of the ERM corresponding to the full training set which happens to also be close to a local minimum of the expected risk minimization problem with high probability. We formally state the conditions on the size of the initial sample set and characterize the required accuracy for obtaining an approximate local minimum to ensure that the iterates always stay in a neighborhood of a local minimum and do not get attracted to saddle points.

[1]  Thomas Hofmann,et al.  Starting Small - Learning with Adaptive Sample Sizes , 2016, ICML.

[2]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[3]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[4]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[5]  Aryan Mokhtari,et al.  IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate , 2017, SIAM J. Optim..

[6]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[7]  Aryan Mokhtari,et al.  Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method , 2017, AISTATS.

[8]  Tianbao Yang,et al.  First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time , 2017, NeurIPS.

[9]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[10]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[11]  Yair Carmon,et al.  Lower bounds for finding stationary points II: first-order methods , 2017, Mathematical Programming.

[12]  Stephen J. Wright,et al.  Complexity Analysis of Second-Order Line-Search Algorithms for Smooth Nonconvex Optimization , 2017, SIAM J. Optim..

[13]  Alexander J. Smola,et al.  Fast incremental method for smooth nonconvex optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[14]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[15]  Aryan Mokhtari,et al.  Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy , 2016, NIPS.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[18]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[19]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[20]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[21]  Aryan Mokhtari,et al.  A Second Order Method for Nonconvex Optimization , 2017 .

[22]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[23]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[24]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[25]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[26]  Aryan Mokhtari,et al.  Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy , 2018, AISTATS.

[27]  Asuman E. Ozdaglar,et al.  Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods , 2016, SIAM J. Optim..

[28]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[29]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[30]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[31]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[32]  Aryan Mokhtari,et al.  Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate , 2016, SIAM J. Optim..

[33]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[34]  Asuman E. Ozdaglar,et al.  A globally convergent incremental Newton method , 2014, Math. Program..

[35]  Alexander J. Smola,et al.  A Generic Approach for Escaping Saddle points , 2017, AISTATS.

[36]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[37]  Aryan Mokhtari,et al.  First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization , 2017, NIPS.

[38]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[39]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[40]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.