论文信息 - Practical Recommendations for Gradient-Based Training of Deep Architectures

Practical Recommendations for Gradient-Based Training of Deep Architectures

Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyperparameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures.

Yoshua Bengio | Yoshua Bengio

[1] Geoffrey E. Hinton. Relaxation and its role in vision , 1977 .

[2] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[3] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[4] Johan Håstad,et al. Almost optimal lower bounds for small depth circuits , 1986, STOC '86.

[5] Yann LeCun,et al. Generalization and network design strategies , 1989 .

[6] Geoffrey E. Hinton. Learning distributed representations of concepts. , 1989 .

[7] Geoffrey E. Hinton. Connectionist Learning Procedures , 1989, Artif. Intell..

[8] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[9] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10] Jordan B. Pollack,et al. Recursive Distributed Representations , 1990, Artif. Intell..

[11] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[12] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[13] J. Elman. Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[14] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[15] Jonathan Baxter,et al. Learning internal representations , 1995, COLT '95.

[16] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[17] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18] Emile H. L. Aarts,et al. Boltzmann machines , 1998 .

[19] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[20] Alessandro Sperduti,et al. A general framework for adaptive processing of data structures , 1998, IEEE Trans. Neural Networks.

[21] J. Tenenbaum,et al. A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[22] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[23] Laurenz Wiskott,et al. Applying Slow Feature Analysis to Image Sequences Yields a Rich Repertoire of Complex Cell Properties , 2002, ICANN.

[24] Terrence J. Sejnowski,et al. Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[25] J. van Leeuwen,et al. Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[26] Jonathan Baxter,et al. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[27] Ronald,et al. Learning representations by backpropagating errors , 2004 .

[28] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[29] Samy Bengio,et al. Links between perceptrons, MLPs and SVMs , 2004, ICML.

[30] Johan Håstad,et al. On the power of small-depth threshold circuits , 1991, computational complexity.

[31] Nicolas Le Roux,et al. Convex Neural Networks , 2005, NIPS.

[32] Nicolas Le Roux,et al. The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[33] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[34] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[35] Marc'Aurelio Ranzato,et al. Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[36] Matthew Richardson,et al. Markov logic networks , 2006, Machine Learning.

[37] Honglak Lee,et al. Sparse deep belief net model for visual area V2 , 2007, NIPS.

[38] H. Robbins. A Stochastic Approximation Method , 1951 .

[39] Marc'Aurelio Ranzato,et al. Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[40] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[41] Nicolas Le Roux,et al. Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[42] Yoshua Bengio,et al. Scaling learning algorithms towards AI , 2007 .

[43] Andrew McCallum,et al. Introduction to Statistical Relational Learning , 2007 .

[44] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[45] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[46] Jason Weston,et al. Deep learning via semi-supervised embedding , 2008, ICML '08.

[47] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[48] Yoshua Bengio,et al. Neural net language models , 2008, Scholarpedia.

[49] Yoshua Bengio,et al. Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[50] David M. Bradley,et al. Differentiable Sparse Coding , 2008, NIPS.

[51] Geoffrey E. Hinton,et al. Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[52] Yoshua Bengio,et al. Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[53] Frank Hutter,et al. Automated configuration of algorithms for solving hard computational problems , 2009 .

[54] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[55] Yann LeCun,et al. What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[56] Quoc V. Le,et al. Measuring Invariances in Deep Networks , 2009, NIPS.

[57] R. Fergus,et al. Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58] Ashwin Srinivasan,et al. Parameter Screening and Optimisation for ILP using Designed Experiments , 2011, J. Mach. Learn. Res..

[59] Yurii Nesterov,et al. Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[60] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.

[61] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[62] Patrick Gallinari,et al. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[63] David D. Cox,et al. A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation , 2009, PLoS Comput. Biol..

[64] P. Dayan,et al. Flexible shaping: How learning in small steps helps , 2009, Cognition.

[65] Aaron C. Courville,et al. Understanding Representations Learned in Deep Architectures , 2010 .

[66] Quoc V. Le,et al. Tiled convolutional neural networks , 2010, NIPS.

[67] Yoshua Bengio,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[68] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[69] Nicolas Le Roux,et al. Improving First and Second-Order Methods by Modeling Uncertainty , 2010 .

[70] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[71] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[72] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[73] Christopher D. Manning,et al. Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks , 2010 .

[74] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[75] VincentPascal,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .

[76] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[77] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[78] Nando de Freitas,et al. A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[79] Yoshua Bengio,et al. Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[80] Quoc V. Le,et al. On optimization methods for deep learning , 2011, ICML.

[81] Pascal Vincent,et al. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[82] Tapani Raiko,et al. Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[83] Tapani Raiko,et al. Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[84] Jason Weston,et al. Learning Structured Embeddings of Knowledge Bases , 2011, AAAI.

[85] Zhenghao Chen,et al. On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[86] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[87] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[88] Yoshua Bengio,et al. Unsupervised Models of Images by Spikeand-Slab RBMs , 2011, ICML.

[89] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[90] Yoshua Bengio,et al. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[91] Andrew Y. Ng,et al. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[92] Kevin Leyton-Brown,et al. Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[93] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[94] Pascal Vincent,et al. A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[95] Pascal Vincent,et al. The Manifold Tangent Classifier , 2011, NIPS.

[96] Yoshua Bengio,et al. On the Expressive Power of Deep Architectures , 2011, ALT.

[97] Andrew Y. Ng,et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[98] Yoshua Bengio,et al. Large-Scale Learning of Embeddings with Reconstruction Sampling , 2011, ICML.

[99] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[100] Pascal Vincent,et al. Quickly Generating Representative Samples from an RBM-Derived Process , 2011, Neural Computation.

[101] Will Y. Zou. Unsupervised learning of visual invariance with temporal coherence , 2011 .

[102] Yoshua Bengio,et al. Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[103] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets , 2012, ArXiv.

[104] Yoshua Bengio,et al. Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[105] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[106] Yoshua Bengio,et al. Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[107] Nicol N. Schraudolph,et al. Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[108] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[109] Yoshua Bengio,et al. Spike-and-Slab Sparse Coding for Unsupervised Feature Discovery , 2012, ArXiv.

[110] Yoshua Bengio,et al. A Generative Process for sampling Contractive Auto-Encoders , 2012, ICML 2012.

[111] Klaus-Robert Müller,et al. Deep Boltzmann Machines as Feed-Forward Hierarchies , 2012, AISTATS.

[112] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[113] Tapani Raiko,et al. Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[114] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[115] Yoshua Bengio,et al. Implicit Density Estimation by Local Moment Matching to Sample from Auto-Encoders , 2012, ArXiv.

[116] Jason Weston,et al. Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing , 2012, AISTATS.

[117] Léon Bottou,et al. From machine learning to machine reasoning , 2011, Machine Learning.

[118] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.