Deep Learning of Representations: Looking Forward

Deep learning research aims at discovering learning algorithms that discover multiple levels of distributed representations, with higher levels representing more abstract concepts. Although the study of deep learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, several challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward-looking research directions aimed at overcoming these challenges.

[1]  J. Laurie Snell,et al.  I. The Ising model , 1980 .

[2]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[5]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[6]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[7]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[8]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[9]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[10]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.

[11]  Teuvo Kohonen,et al.  Emergence of invariant-feature detectors in the adaptive-subspace self-organizing map , 1996, Biological Cybernetics.

[12]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  H. Sebastian Seung,et al.  Learning Continuous Attractors in Recurrent Networks , 1997, NIPS.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[17]  Geoffrey E. Hinton Products of experts , 1999 .

[18]  Rosenbaum,et al.  Quantum annealing of a disordered magnet , 1999, Science.

[19]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[20]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[21]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[22]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[23]  D. Heckerman,et al.  Dependency networks for inference , 2000 .

[24]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[25]  Sven Behnke,et al.  Learning Iterative Image Reconstruction in the Neural Abstraction Pyramid , 2001, Int. J. Comput. Intell. Appl..

[26]  Yukito Iba EXTENDED ENSEMBLE MONTE CARLO , 2001 .

[27]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[28]  Samy Bengio,et al.  Scaling Large Learning Problems with Hard Parallel Mixtures , 2002, SVM.

[29]  Lawrence Cayton,et al.  Algorithms for manifold learning , 2005 .

[30]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[31]  Pascal Vincent,et al.  Non-Local Manifold Parzen Windows , 2005, NIPS.

[32]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[33]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[34]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[35]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[36]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[37]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[38]  Yoshua Bengio,et al.  Nonlocal Estimation of Manifold Structure , 2006, Neural Computation.

[39]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[40]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[41]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[42]  Herbert Jaeger,et al.  Echo state network , 2007, Scholarpedia.

[43]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[44]  Roger B. Grosse,et al.  Shift-Invariance Sparse Coding for Audio Classification , 2007, UAI.

[45]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[46]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[47]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[48]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[49]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[50]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[51]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[52]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[53]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[54]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[55]  Yoshua Bengio,et al.  Neural net language models , 2008, Scholarpedia.

[56]  Elisa Ricci,et al.  Large Margin Methods for Structured Output Prediction , 2008, Computational Intelligence Paradigms.

[57]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[58]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[59]  David M. Bradley,et al.  Differentiable Sparse Coding , 2008, NIPS.

[60]  Yoshua Bengio,et al.  Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.

[61]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[62]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[63]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[64]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[65]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[66]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[67]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[68]  A. Hyvärinen,et al.  Estimation of Non-normalized Statistical Models , 2009 .

[69]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[70]  Hugo Larochelle,et al.  Efficient Learning of Deep Boltzmann Machines , 2010, AISTATS.

[71]  Ilya Sutskever,et al.  Parallelizable Sampling of Markov Random Fields , 2010, AISTATS.

[72]  Dong Yu,et al.  Sequential Labeling Using Deep-Structured Conditional Random Fields , 2010, IEEE Journal of Selected Topics in Signal Processing.

[73]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[74]  Ruslan Salakhutdinov,et al.  Learning Deep Boltzmann Machines using Adaptive MCMC , 2010, ICML.

[75]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[76]  Yann LeCun,et al.  Regularized estimation of image statistics by Score Matching , 2010, NIPS.

[77]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[78]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[79]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[80]  Dong Yu,et al.  Language recognition using deep-structured conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81]  Yoshua Bengio,et al.  DECISION TREES DO NOT GENERALIZE TO NEW VARIATIONS , 2010, Comput. Intell..

[82]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[83]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[84]  Hariharan Narayanan,et al.  Sample Complexity of Testing the Manifold Hypothesis , 2010, NIPS.

[85]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[86]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[87]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[88]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[89]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[90]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[91]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[92]  Pascal Vincent,et al.  Higher Order Contractive Auto-Encoder , 2011, ECML/PKDD.

[93]  Nando de Freitas,et al.  On Autoencoders and Score Matching for Energy Based Models , 2011, ICML.

[94]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[95]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[96]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[97]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[98]  Yoshua Bengio,et al.  Unsupervised Models of Images by Spikeand-Slab RBMs , 2011, ICML.

[99]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[100]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[101]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[102]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[103]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[104]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[105]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[106]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[107]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[108]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[109]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[110]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[111]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[112]  Pascal Vincent,et al.  Quickly Generating Representative Samples from an RBM-Derived Process , 2011, Neural Computation.

[113]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[114]  Razvan Pascanu,et al.  Learning Algorithms for the Classification Restricted Boltzmann Machine , 2012, J. Mach. Learn. Res..

[115]  Yoshua Bengio,et al.  Large-Scale Feature Learning With Spike-and-Slab Sparse Coding , 2012, ICML.

[116]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[117]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[118]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[119]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[120]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[121]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[122]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[123]  F. Savard Réseaux de neurones à relaxation entraînés par critère d'autoencodeur débruitant , 2012 .

[124]  Michael G. Rabbat,et al.  Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization , 2012, NIPS.

[125]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[126]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[127]  Yoshua Bengio,et al.  Disentangling Factors of Variation via Generative Entangling , 2012, ArXiv.

[128]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[129]  Yoshua Bengio,et al.  Evolving Culture vs Local Minima , 2012, ArXiv.

[130]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[131]  Andrew Y. Ng,et al.  Emergence of Object-Selective Features in Unsupervised Feature Learning , 2012, NIPS.

[132]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[133]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[134]  Yoshua Bengio,et al.  Spike-and-Slab Sparse Coding for Unsupervised Feature Discovery , 2012, ArXiv.

[135]  Yoshua Bengio,et al.  A Generative Process for sampling Contractive Auto-Encoders , 2012, ICML 2012.

[136]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[137]  Yoshua Bengio,et al.  Joint Training of Deep Boltzmann Machines , 2012, ArXiv.

[138]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[139]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[140]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[141]  Hossein Mobahi,et al.  Deep Learning via Semi-supervised Embedding , 2012, Neural Networks: Tricks of the Trade.

[142]  Yoshua Bengio,et al.  Implicit Density Estimation by Local Moment Matching to Sample from Auto-Encoders , 2012, ArXiv.

[143]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[144]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[145]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[146]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[147]  Yoshua Bengio,et al.  Big Neural Networks Waste Capacity , 2013, ICLR.

[148]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons , 2013, ArXiv.

[149]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[150]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[151]  Yoshua Bengio,et al.  Joint Training Deep Boltzmann Machines for Classification , 2013, ICLR.

[152]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[153]  Honglak Lee,et al.  Learning and Selecting Features Jointly with Point-wise Gated Boltzmann Machines , 2013, ICML.

[154]  Richard S. Zemel,et al.  Exploring Compositional High Order Pattern Potentials for Structured Output Learning , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[155]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[156]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[157]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[158]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[159]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[160]  Jason Weston,et al.  A semantic matching energy function for learning with multi-relational data , 2013, Machine Learning.

[161]  Yoshua Bengio J un 2 01 3 Deep Learning of Representations : Looking Forward , 2013 .

[162]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[163]  Yoshua Bengio,et al.  Texture Modeling with Convolutional Spike-and-Slab RBMs and Deep Extensions , 2012, AISTATS.

[164]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[165]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[166]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.