On layer-wise representations in deep neural networks

On Layer-Wise Representations in Deep Neural Networks It is well-known that deep neural networks are forming an efficient internal representation of the learning problem. However, it is unclear how this efficient representation is distributed layer-wise, and how it arises from learning. In this thesis, we develop a kernel-based analysis for deep networks that quantifies the representation at each layer in terms of noise and dimensionality. The analysis is applied to backpropagation networks and deep Boltzmann machines, and is able to capture the layer-wise reduction of noise and dimensionality. The analysis also reveals the disrupting effect of learning noise, and how it prevents the emergence of highly sophisticated deep models. Zusammenfassung Schichtweise Reprasentationen in Tiefen Neuronalen Netzen Es ist bekannt, dass tiefe neuronale Netze eine effiziente interne Reprasentation des Lernproblems bilden. Es ist jedoch unklar, wie sich diese effiziente Reprasentation uber die Schichten verteilt und wie sie beim Lernen entsteht. In dieser Arbeit entwickeln wir eine Kernel-basierte Analyse fur tiefe Netze. Diese Analyse quantifiziert die Reprasentation in jeder Schicht in Bezug auf Rauschen und Dimensionalitat. Wir wenden die Analyse auf Backpropagation-Netze und tiefe Boltzmann-Maschinen an und messen die schichtweise Reduzierung von Rauschen und Dimensionalitat. Die Analyse zeigt auch den storenden Einfluss des Lernrauschens: Dieses verhindert die Entstehung komplexer Strukturen in tiefen Modellen.

[1]  Joachim M. Buhmann,et al.  On Relevant Dimensions in Kernel Feature Spaces , 2008, J. Mach. Learn. Res..

[2]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[3]  Yann Ollivier,et al.  Layer-wise learning of deep generative models , 2012, ArXiv.

[4]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[5]  Tapani Raiko,et al.  Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[6]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[7]  P. Hohenberg,et al.  Inhomogeneous Electron Gas , 1964 .

[8]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[9]  Klaus-Robert Müller,et al.  Analyzing Local Structure in Kernel-Based Learning: Explanation, Complexity, and Reliability Assessment , 2013, IEEE Signal Processing Magazine.

[10]  A. Gross,et al.  Representing high-dimensional potential-energy surfaces for reactions at surfaces by neural networks , 2004 .

[11]  S. V. N. Vishwanathan,et al.  Fast Iterative Kernel Principal Component Analysis , 2007, J. Mach. Learn. Res..

[12]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[13]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[14]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[15]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[16]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[17]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[18]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[19]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[20]  Zoubin Ghahramani,et al.  Unsupervised Learning , 2003, Advanced Lectures on Machine Learning.

[21]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[22]  Ilya Sutskever,et al.  Data Normalization in the Learning of Restricted Boltzmann Machines , 2011 .

[23]  Paul E. Utgoff,et al.  Many-Layered Learning , 2002, Neural Computation.

[24]  Roman M. Balabin,et al.  Neural network approach to quantum-chemistry data: accurate prediction of density functional theory energies. , 2009, The Journal of chemical physics.

[25]  Luca Maria Gambardella,et al.  Deep, Big, Simple Neural Nets for Handwritten Digit Recognition , 2010, Neural Computation.

[26]  Klaus-Robert Müller,et al.  Finding Density Functionals with Machine Learning , 2011, Physical review letters.

[27]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[28]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[29]  H. B. Barlow,et al.  Unsupervised Learning , 1989, Neural Computation.

[30]  J. Behler Neural network potential-energy surfaces in chemistry: a tool for large-scale simulations. , 2011, Physical chemistry chemical physics : PCCP.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[33]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[34]  B. Schölkopf,et al.  Kernel Hebbian Algorithm for Iterative Kernel Principal Component Analysis , 2003 .

[35]  L. Hedin NEW METHOD FOR CALCULATING THE ONE-PARTICLE GREEN'S FUNCTION WITH APPLICATION TO THE ELECTRON-GAS PROBLEM , 1965 .

[36]  T. Poggio,et al.  Learning and Invariance in a Family of Hierarchical Kernels , 2010 .

[37]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[38]  M. Rupp,et al.  Machine learning of molecular electronic properties in chemical compound space , 2013, 1305.7074.

[39]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[41]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines as Feed-Forward Hierarchies , 2012, AISTATS.

[42]  Klaus-Robert Müller,et al.  Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. , 2013, Journal of chemical theory and computation.

[43]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[45]  Klaus-Robert Müller,et al.  Layer-wise analysis of deep networks with Gaussian kernels , 2010, NIPS.

[46]  John C. Snyder,et al.  Orbital-free bond breaking via machine learning. , 2013, The Journal of chemical physics.

[47]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition - Tangent Distance and Tangent Propagation , 2012, Neural Networks: Tricks of the Trade.

[48]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[49]  A. Tkatchenko,et al.  Accurate and efficient method for many-body van der Waals interactions. , 2012, Physical review letters.

[50]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[51]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[52]  Chong-Ho Choi,et al.  Thermometer coding for multilayer perceptron learning on continuous mapping problems , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[53]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[54]  Lorenz C. Blum,et al.  970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. , 2009, Journal of the American Chemical Society.

[55]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[56]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[57]  Exploring QSAR. , 1995, Environmental science & technology.

[58]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[59]  Hugo Larochelle,et al.  Efficient Learning of Deep Boltzmann Machines , 2010, AISTATS.

[60]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[61]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[63]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[64]  K. Müller,et al.  Neural Networks for Computational Chemistry: Pitfalls and Recommendations , 2013 .

[65]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[66]  K. Burke,et al.  Rationale for mixing exact exchange with density functional approximations , 1996 .

[67]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[68]  Ruslan Salakhutdinov,et al.  Learning Deep Boltzmann Machines using Adaptive MCMC , 2010, ICML.

[69]  Luca Maria Gambardella,et al.  Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition , 2010, ArXiv.

[70]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[71]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[72]  Ha Hong,et al.  The Neural Representation Benchmark and its Evaluation on Brain and Machine , 2013, ICLR.

[73]  D. Hubel,et al.  Receptive fields of single neurones in the cat's striate cortex , 1959, The Journal of physiology.

[74]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[75]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[76]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[77]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[78]  Yoshua Bengio,et al.  Joint Training Deep Boltzmann Machines for Classification , 2013, ICLR.

[79]  Klaus-Robert Müller,et al.  Kernel Analysis of Deep Networks , 2011, J. Mach. Learn. Res..

[80]  Michael C. Zerner,et al.  AN INTERMEDIATE NEGLECT OF DIFFERENTIAL OVERLAP TECHNIQUE FOR SPECTROSCOPY OF TRANSITION-METAL COMPLEXES. FERROCENE , 1980 .

[81]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[82]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[83]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[84]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[85]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[86]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[87]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[88]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[89]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[90]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[91]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[92]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[93]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[94]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[95]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[96]  Nathan Intrator,et al.  Optimal ensemble averaging of neural networks , 1997 .