The Loss Surface of XOR Artificial Neural Networks

Training an artificial neural network involves an optimization process over the landscape defined by the cost (loss) as a function of the network parameters. We explore these landscapes using optimization tools developed for potential energy landscapes in molecular science. The number of local minima and transition states (saddle points of index one), as well as the ratio of transition states to minima, grow rapidly with the number of nodes in the network. There is also a strong dependence on the regularization parameter, with the landscape becoming more convex (fewer minima) as the regularization term increases. We demonstrate that in our formulation, stationary points for networks with N_{h} hidden nodes, including the minimal network required to fit the XOR data, are also stationary points for networks with N_{h}+1 hidden nodes when all the weights involving the additional node are zero. Hence, smaller networks trained on XOR data are embedded in the landscapes of larger networks. Our results clarify certain aspects of the classification and sensitivity (to perturbations in the input data) of minima and saddle points for this system, and may provide insight into dropout and network compression.

[1]  Tamiki Komatsuzaki,et al.  How many dimensions are required to approximate the potential energy landscape of a model protein? , 2005, The Journal of chemical physics.

[2]  David J Wales,et al.  Potential energy and free energy landscapes. , 2006, The journal of physical chemistry. B.

[3]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[4]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[5]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[6]  M. A. Virasoro,et al.  Barriers and metastable states as saddle points in the replica approach , 1993 .

[7]  Robert Hecht-Nielsen,et al.  On the Geometry of Feedforward Neural Network Error Surfaces , 1993, Neural Computation.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Kewei Tu,et al.  Mapping the Energy Landscape of Non-convex Optimization Problems , 2014, EMMCVPR.

[11]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[12]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[13]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[14]  H. Scheraga,et al.  Global optimization of clusters, crystals, and biomolecules. , 1999, Science.

[15]  Leonard G. C. Hamey,et al.  XOR has no local minima: A case study in neural network error surface analysis , 1998, Neural Networks.

[16]  J. Doye,et al.  Thermodynamics and the Global Optimization of Lennard-Jones clusters , 1998, cond-mat/9806020.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[19]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[20]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[21]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[22]  Dhagash Mehta,et al.  Finding all the stationary points of a potential-energy landscape via numerical polynomial-homotopy-continuation method. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Michael Page,et al.  On evaluating the reaction path Hamiltonian , 1988 .

[24]  Ida G. Sprinkhuizen-Kuyper,et al.  The Error Surface of the Simplest XOR Network Has Only Global Minima , 1996, Neural Computation.

[25]  F. Noé,et al.  Transition networks for modeling the kinetics of conformational change in macromolecules. , 2008, Current opinion in structural biology.

[26]  David J. Wales,et al.  Some further applications of discrete path sampling to cluster isomerization , 2004 .

[27]  Mark A. Miller,et al.  Archetypal energy landscapes , 1998, Nature.

[28]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[29]  J. Doye,et al.  THE DOUBLE-FUNNEL ENERGY LANDSCAPE OF THE 38-ATOM LENNARD-JONES CLUSTER , 1998, cond-mat/9808265.

[30]  A. Cavagna,et al.  Spin-glass theory for pedestrians , 2005, cond-mat/0505032.

[31]  Ida G. Sprinkhuizen-Kuyper,et al.  The error surface of the 2-2-1 XOR network: The finite stationary points , 1998, Neural Networks.

[32]  Antonio Auffinger,et al.  Complexity of random smooth functions on the high-dimensional sphere , 2011, 1110.5872.

[33]  David J. Wales,et al.  Exploring biomolecular energy landscapes. , 2017, Chemical communications.

[34]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[35]  D. Mehta,et al.  Energy landscape of the finite-size mean-field 2-spin spherical model and topology trivialization. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[36]  David J Wales,et al.  Finding pathways between distant local minima. , 2005, The Journal of chemical physics.

[37]  Marcus Gallagher,et al.  Multi-layer Perceptron Error Surfaces: Visualization, Structure and Modelling , 2000 .

[38]  Xiao-Hu Yu,et al.  Can backpropagation error surface not have local minima , 1992, IEEE Trans. Neural Networks.

[39]  K. Laidler,et al.  Symmetries of activated complexes , 1968 .

[40]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[41]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[42]  D. Mehta,et al.  Potential energy landscape of the two-dimensional XY model: higher-index stationary points. , 2014, The Journal of chemical physics.

[43]  G. Henkelman,et al.  Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points , 2000 .

[44]  A. Crisanti,et al.  The sphericalp-spin interaction spin glass model: the statics , 1992 .

[45]  David J Wales,et al.  Energy landscapes: some new horizons. , 2010, Current opinion in structural biology.

[46]  Razvan Pascanu,et al.  Local minima in training of neural networks , 2016, 1611.06310.

[47]  Vineeth N. Balasubramanian,et al.  Are Saddles Good Enough for Deep Learning? , 2017, ArXiv.

[48]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[49]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[50]  F. Rao,et al.  The protein folding network. , 2004, Journal of molecular biology.

[51]  Graeme Henkelman,et al.  Unification of algorithms for minimum mode optimization. , 2014, The Journal of chemical physics.

[52]  David J. Wales,et al.  Transition states and rearrangement mechanisms from hybrid eigenvector-following and density functional theory. Application to C10H10 and defect migration in crystalline silicon , 2001 .

[53]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[54]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[55]  Jonathan P. K. Doye,et al.  Stationary points and dynamics in high-dimensional systems , 2003 .

[56]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[57]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[58]  David J Wales,et al.  Energy landscapes for a machine learning application to series data. , 2016, The Journal of chemical physics.

[59]  Ida G. Sprinkhuizen-Kuyper,et al.  A local minimum for the 2-3-1 XOR network , 1999, IEEE Trans. Neural Networks.

[60]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[62]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[63]  P. L. Doussal,et al.  Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization , 2013, 1304.0024.

[64]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[65]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[66]  J. Doye,et al.  Saddle Points and Dynamics of Lennard-Jones Clusters, Solids and Supercooled Liquids , 2001, cond-mat/0108310.

[67]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations , 1970 .

[68]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[69]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[70]  D. Wales A Microscopic Basis for the Global Appearance of Energy Landscapes , 2001, Science.

[71]  Lindsey J. Munro,et al.  DEFECT MIGRATION IN CRYSTALLINE SILICON , 1999 .

[72]  Virginia L. Stonick,et al.  488 Solutions to the XOR Problem , 1996, NIPS.

[73]  Y. Fyodorov High-Dimensional Random Fields and Random Matrix Theory , 2013, 1307.2379.

[74]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[75]  Roberto Cipolla,et al.  Symmetry-invariant optimization in deep networks , 2015, ArXiv.

[76]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[77]  Stefano Soatto,et al.  Trivializing The Energy Landscape Of Deep Networks , 2015, ArXiv.

[78]  Dhagash Mehta,et al.  Stationary point analysis of the one-dimensional lattice Landau gauge fixing functional, aka random phase XY Hamiltonian , 2010, 1010.5335.

[79]  J. Doye,et al.  Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms , 1997, cond-mat/9803344.

[80]  David J Wales,et al.  Thermodynamics and kinetics of aggregation for the GNNQQNY peptide. , 2007, Journal of the American Chemical Society.

[81]  David J. Wales,et al.  Free energy landscapes of model peptides and proteins , 2003 .

[82]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[83]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[84]  C. Lee Giles,et al.  What Size Neural Network Gives Optimal Generalization? Convergence Properties of Backpropagation , 1998 .

[85]  D. Wales,et al.  A doubly nudged elastic band method for finding transition states. , 2004, The Journal of chemical physics.

[86]  D. Mehta,et al.  Energy-landscape analysis of the two-dimensional nearest-neighbor φ⁴ model. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[87]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[88]  Dhagash Mehta,et al.  Phase transitions detached from stationary points of the energy landscape. , 2011, Physical review letters.

[89]  Dhagash Mehta,et al.  Energy landscape of the finite-size spherical three-spin glass model. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[90]  H. Scheraga,et al.  Monte Carlo-minimization approach to the multiple-minima problem in protein folding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Diego Prada-Gracia,et al.  Exploring the Free Energy Landscape: From Dynamics to Networks and Back , 2009, PLoS Comput. Biol..

[92]  M. Karplus,et al.  The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics , 1997 .

[93]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[94]  Ida G. Sprinkhuizen-Kuyper,et al.  The local minima of the error surface of the 2-2-1 XOR network , 2004, Annals of Mathematics and Artificial Intelligence.

[95]  G. Henkelman,et al.  A climbing image nudged elastic band method for finding saddle points and minimum energy paths , 2000 .

[96]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[97]  Dhagash Mehta,et al.  Statistics of stationary points of random finite polynomial potentials , 2015, 1504.02786.

[98]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[99]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[100]  D. Wales Discrete path sampling , 2002 .

[101]  David J Wales,et al.  New results for phase transitions from catastrophe theory. , 2004, The Journal of chemical physics.

[102]  G. Henkelman,et al.  A dimer method for finding saddle points on high dimensional potential surfaces using only first derivatives , 1999 .

[103]  Sergei V. Krivov,et al.  Free energy disconnectivity graphs: Application to peptide models , 2002 .