Learning Dynamics of Linear Denoising Autoencoders

Denoising autoencoders (DAEs) have proven useful for unsupervised representation learning, but a thorough theoretical understanding is still lacking of how the input noise influences learning. Here we develop theory for how noise influences learning in DAEs. By focusing on linear DAEs, we are able to derive analytic expressions that exactly describe their learning dynamics. We verify our theoretical predictions with simulations as well as experiments on MNIST and CIFAR-10. The theory illustrates how, when tuned correctly, noise allows DAEs to ignore low variance directions in the inputs while learning to reconstruct them. Furthermore, in a comparison of the learning dynamics of DAEs to standard regularised autoencoders, we show that noise has a similar regularisation effect to weight decay, but with faster training dynamics. We also show that our theoretical predictions approximate learning dynamics on real-world data and qualitatively match observed dynamics in nonlinear DAEs.

[1]  Frank Nielsen,et al.  Statistical exponential families: A digest with flash cards , 2009, ArXiv.

[2]  Kilian Q. Weinberger,et al.  Marginalizing stacked linear denoising autoencoders , 2015, J. Mach. Learn. Res..

[3]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[4]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[5]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[6]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[7]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[8]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[9]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[10]  James L. McClelland,et al.  Learning hierarchical category structure in deep neural networks , 2013 .

[11]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[12]  D Zipser,et al.  Learning the hidden structure of speech. , 1988, The Journal of the Acoustical Society of America.

[13]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[14]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[15]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[16]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[17]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Gustav Larsson,et al.  Discovery of Visual Semantics by Unsupervised and Self-Supervised Representation Learning , 2017, ArXiv.

[19]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[20]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[21]  Yoshua Bengio,et al.  Marginalized Denoising Auto-encoders for Nonlinear Representations , 2014, ICML.

[22]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[23]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[24]  Opper Learning times of neural networks: Exact solution for a PERCEPTRON algorithm. , 1988, Physical review. A, General physics.

[25]  Yang Liu,et al.  Neural Machine Translation with Reconstruction , 2016, AAAI.

[26]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[27]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[28]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[29]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[30]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[31]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[32]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[33]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[34]  Surya Ganguli,et al.  Analyzing noise in autoencoders and deep networks , 2014, ArXiv.

[35]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[36]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[37]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[38]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[39]  Razvan Pascanu,et al.  Local minima in training of neural networks , 2016, 1611.06310.