The sharp, the flat and the shallow: Can weakly interacting agents learn to escape bad minima?

An open problem in machine learning is whether flat minima generalize better and how to compute such minima efficiently. This is a very challenging problem. As a first step towards understanding this question we formalize it as an optimization problem with weakly interacting agents. We review appropriate background material from the theory of stochastic processes and provide insights that are relevant to practitioners. We propose an algorithmic framework for an extended stochastic gradient Langevin dynamics and illustrate its potential. The paper is written as a tutorial, and presents an alternative use of multi-agent learning. Our primary focus is on the design of algorithms for machine learning applications; however the underlying mathematical framework is suitable for the understanding of large scale systems of agent based models that are popular in the social sciences, economics and finance.

[1]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[2]  Vladas Sidoravicius,et al.  Stochastic Processes and Applications , 2007 .

[3]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[4]  F. Schweitzer Brownian Agents and Active Particles , 2003, Springer Series in Synergetics.

[5]  Martin Benning,et al.  Choose Your Path Wisely: Gradient Descent in a Bregman Distance Framework , 2017, SIAM J. Imaging Sci..

[6]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[7]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[8]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[9]  C. Villani Topics in Optimal Transportation , 2003 .

[10]  Y. Tamura On asymptotic behaviors of the solution of a nonlinear diffusion equation , 1984 .

[11]  J. J. Moré,et al.  Global continuation for distance geometry problems , 1995 .

[12]  Stefan Wrobel,et al.  Efficient Decentralized Deep Learning by Dynamic Model Averaging , 2018, ECML/PKDD.

[13]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[14]  Lorenzo Pareschi,et al.  Reviews , 2014 .

[15]  G. Burton TOPICS IN OPTIMAL TRANSPORTATION (Graduate Studies in Mathematics 58) By CÉDRIC VILLANI: 370 pp., US$59.00, ISBN 0-8218-3312-X (American Mathematical Society, Providence, RI, 2003) , 2004 .

[16]  Florent Malrieu,et al.  Logarithmic Sobolev Inequalities for Some Nonlinear Pde's , 2001 .

[17]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[18]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[19]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[20]  R. Pinnau,et al.  A consensus-based model for global optimization and its mean-field limit , 2016, 1604.05648.

[21]  Shiino Dynamical behavior of stochastic systems of infinitely many coupled nonlinear oscillators exhibiting phase transitions of mean-field type: H theorem on asymptotic approach to equilibrium and critical slowing down of order-parameter fluctuations. , 1987, Physical review. A, General physics.

[22]  Alain Durmus,et al.  An elementary approach to uniform in time propagation of chaos , 2018, Proceedings of the American Mathematical Society.

[23]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[24]  Andrew M. Stuart,et al.  Ensemble Kalman inversion: a derivative-free technique for machine learning tasks , 2018, Inverse Problems.

[25]  Julian Tugaut Captivity of mean-field systems ∗ , 2011 .

[26]  Stuart GEMANf DIFFUSIONS FOR GLOBAL OPTIMIZATION , 2022 .

[27]  S. Geman,et al.  Diffusions for global optimizations , 1986 .

[28]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[29]  D. Dawson Critical dynamics and fluctuations for a mean-field model of cooperative behavior , 1983 .

[30]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[31]  Taiji Suzuki,et al.  Stochastic Particle Gradient Descent for Infinite Ensembles , 2017, ArXiv.

[32]  P. Cattiaux,et al.  Probabilistic approach for granular media equations in the non-uniformly convex case , 2006, math/0603541.

[33]  Jos'e A. Carrillo,et al.  An analytical framework for a consensus-based global optimization method , 2016, 1602.00220.

[34]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[35]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[36]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[37]  Julian Tugaut,et al.  Phase transitions of McKean–Vlasov processes in double-wells landscape , 2014 .

[38]  P. Del Moral,et al.  Uniform propagation of chaos and creation of chaos for a class of nonlinear diffusions , 2019, Stochastic Analysis and Applications.

[39]  Grigorios A. Pavliotis,et al.  Multiscale Methods: Averaging and Homogenization , 2008 .

[40]  Stefano Soatto,et al.  Deep relaxation: partial differential equations for optimizing deep neural networks , 2017, Research in the Mathematical Sciences.

[41]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[42]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[43]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[44]  Zhijun Wu,et al.  The Eeective Energy Transformation Scheme as a General Continuation Approach to Global Optimization with Application to Molecular Conformation , 2022 .

[45]  Grigorios A. Pavliotis,et al.  Mean Field Limits for Interacting Diffusions in a Two-Scale Potential , 2017, J. Nonlinear Sci..

[46]  Panos Parpas,et al.  Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks , 2019, ArXiv.

[47]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[48]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[49]  E. Vanden-Eijnden,et al.  Analysis of multiscale methods for stochastic differential equations , 2005 .

[50]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[51]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[52]  Pierre Del Moral,et al.  Mean Field Simulation for Monte Carlo Integration , 2013 .