On the Theoretical Properties of Noise Correlation in Stochastic Optimization

Studying the properties of stochastic noise to optimize complex non-convex functions has been an active area of research in the field of machine learning. Prior work [55, 50] has shown that the noise of stochastic gradient descent improves optimization by overcoming undesirable obstacles in the landscape. Moreover, injecting artificial Gaussian noise has become a popular idea to quickly escape saddle points. Indeed, in the absence of reliable gradient information, the noise is used to explore the landscape, but it is unclear what type of noise is optimal in terms of exploration ability. In order to narrow this gap in our knowledge, we study a general type of continuous-time non-Markovian process, based on fractional Brownian motion, that allows for the increments of the process to be correlated. This generalizes processes based on Brownian motion, such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such processes which gives rise to the new algorithm “fPGD”. This method is a generalization of the known algorithms PGD and Anti-PGD [36]. We study the properties of fPGD both theoretically and empirically, demonstrating that it possesses exploration abilities that, in some cases, are favorable over PGD and Anti-PGD. These results open the field to novel ways to exploit noise for training machine learning models.

[1]  Aurélien Lucchi,et al.  Mean first exit times of Ornstein-Uhlenbeck processes in high-dimensional spaces , 2022, 2208.04029.

[2]  F. Bach,et al.  Explicit Regularization in Overparametrized Models via Noise Injection , 2022, ArXiv.

[3]  Anticorrelated Noise Injection for Improved Generalization , 2022, ArXiv.

[4]  Tengyu Ma,et al.  Label Noise SGD Provably Prefers Flat Global Minimizers , 2021, NeurIPS.

[5]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[6]  Mert Gürbüzbalaban,et al.  The Heavy-Tail Phenomenon in SGD , 2020, ICML.

[7]  Masashi Sugiyama,et al.  A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[8]  Prakhar Verma Sparse Gaussian Processes for Stochastic Differential Equations , 2021 .

[9]  Junmin Liu,et al.  Understanding Long Range Memory Effects in Deep Neural Networks , 2021, ArXiv.

[10]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[11]  Erich Elsen,et al.  On the Generalization Benefit of Noise in Stochastic Gradient Descent , 2020, ICML.

[12]  Guy Blanc,et al.  Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process , 2019, COLT.

[13]  Y. Mishura,et al.  Fractional Ornstein-Uhlenbeck Process with Stochastic Forcing, and its Applications , 2019, Methodology and Computing in Applied Probability.

[14]  Praneeth Netrapalli,et al.  Non-Gaussianity of Stochastic Gradient Noise , 2019, ArXiv.

[15]  David J Schwab,et al.  How noise affects the Hessian spectrum in overparameterized neural networks , 2019, ArXiv.

[16]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[17]  Gaël Richard,et al.  First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , 2019, NeurIPS.

[18]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[19]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[20]  Stochastic representation and path properties of a fractional Cox–Ingersoll–Ross process , 2017, Theory of Probability and Mathematical Statistics.

[21]  Tuo Zhao,et al.  Toward Understanding the Importance of Noise in Training Neural Networks , 2019, ICML.

[22]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[23]  Chen Jia,et al.  Moderate maximal inequalities for the Ornstein-Uhlenbeck process , 2017 .

[24]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[25]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[28]  A. Novikov,et al.  Bounds for expected maxima of Gaussian processes and their discrete approximations , 2015, 1508.00099.

[29]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[30]  B. P. Rao Maximal Inequalities for Fractional Brownian Motion: An Overview , 2014 .

[31]  B. P. Rao Some Maximal Inequalities for Fractional Brownian Motion with Polynomial Drift , 2013 .

[32]  L. Sanders,et al.  First passage times for a tracer particle in single file diffusion and fractional Brownian motion. , 2012, The Journal of chemical physics.

[33]  F. Aurzada On the one-sided exit problem for fractional Brownian motion , 2011, 1101.5072.

[34]  I. Sokolov,et al.  Kramers-like escape driven by fractional Gaussian noise. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[35]  Z. Schuss Theory and Applications of Stochastic Processes: An Analytical Approach , 2009 .

[36]  B. Øksendal,et al.  Stochastic Calculus for Fractional Brownian Motion and Applications , 2008 .

[37]  A. Novikov,et al.  On exit times of Levy-driven Ornstein–Uhlenbeck processes , 2007, 0709.1746.

[38]  Yimin Xiao,et al.  Dimensional Properties of Fractional Brownian Motion , 2007 .

[39]  Dongsheng Wu,et al.  Geometric Properties of Fractional Brownian Sheets , 2007 .

[40]  Davis-type inequalities for some diffusion processes , 2006 .

[41]  A. Ayache,et al.  Asymptotic Properties and Hausdorff Dimensions of Fractional Brownian Sheets , 2005 .

[42]  J. L. Pedersen,et al.  Representations of the First Hitting Time Density of an Ornstein-Uhlenbeck Process , 2005 .

[43]  Patrick Cheridito,et al.  Fractional Ornstein-Uhlenbeck processes , 2003 .

[44]  Bernt Øksendal,et al.  Fractional Brownian Motion in Finance , 2003 .

[45]  Q. Shao,et al.  Gaussian processes: Inequalities, small ball probabilities and applications , 2001 .

[46]  Goran Peskir,et al.  Maximal inequalities for the Ornstein-Uhlenbeck process , 2000 .

[47]  Alexander Novikov,et al.  On some maximal inequalities for fractional Brownian motions , 1999 .

[48]  G. Molchan Maximum of a Fractional Brownian Motion: Probabilities of Small Values , 1999 .

[49]  Zbigniew Michna,et al.  On tail probabilities and first passage times for fractional Brownian motion , 1999, Math. Methods Oper. Res..

[50]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[51]  Alan F. Murray,et al.  Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory , 1992, NIPS.

[52]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[53]  池田 信行,et al.  Stochastic differential equations and diffusion processes , 1981 .

[54]  H. Hurst METHODS OF USING LONG-TERM STORAGE IN RESERVOIRS. , 1956 .

[55]  A. Kolmogorov Wienersche spiralen und einige andere interessante Kurven in Hilbertscen Raum, C. R. (doklady) , 1940 .