Beyond Convexity: Stochastic Quasi-Convex Optimization

Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the concept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.

[1]  H. Varian Price Discrimination and Social Welfare , 1985 .

[2]  Jarosław Sikorski Quasi — Subgradient algorithms for calculating surrogate constraints , 1986 .

[3]  K. Malanowski,et al.  Analysis and Algorithms of Optimization Problems , 1986 .

[4]  R. Khabibullin A method to find a point of a convex set , 1987 .

[5]  Kenji Doya,et al.  Bifurcations of Recurrent Neural Networks in Gradient Descent Learning , 1993 .

[6]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[7]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[8]  Yinyu Ye,et al.  Complexity Analysis of an Interior Cutting Plane Method for Convex Feasibility Problems , 1996, SIAM J. Optim..

[9]  Elmar Wolfstetter,et al.  Topics in microeconomics - industrial organization, auctions, and incentives (repr.) , 1999 .

[10]  Krzysztof C. Kiwiel,et al.  Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[11]  Igor V. Konnov On Convergence Properties of a Subgradient Method , 2003, Optim. Methods Softw..

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Takeo Kanade,et al.  Quasiconvex Optimization for Robust Geometric Reconstruction , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Takeo Kanade,et al.  Quasiconvex Optimization for Robust Geometric Reconstruction , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[16]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[17]  N. Bingham,et al.  Generalised Linear Models , 2010 .

[18]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[19]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[20]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[21]  Luciano Messori The Theory of Incentives I: The Principal-Agent Model , 2013 .