A Fast Learning Algorithm for Deep Belief Nets

We show how to use complementary priors to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

[1]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[2]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[5]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[6]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[9]  Geoffrey E. Hinton,et al.  Recognizing Hand-written Digits Using Hierarchical Products of Experts , 2002, NIPS.

[10]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[11]  Javier R. Movellan,et al.  DIFFUSION NETWORKS , PRODUCT OF EXPERTS , AND FACTOR ANALYSIS , 2001 .

[12]  Javier R. Movellan,et al.  Diffusion Networks, Products of Experts, and Factor Analysis , 2001 .

[13]  Geoffrey E. Hinton,et al.  Learning Sparse Topographic Representations with Products of Student-t Distributions , 2002, NIPS.

[14]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[15]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[16]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[18]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[19]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[20]  Alexander Nareyek,et al.  Choosing search heuristics by non-stationary reinforcement learning , 2004 .

[21]  Michael J. Black,et al.  Fields of Experts: a framework for learning image priors , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Yann LeCun,et al.  Toward automatic phenotyping of developing embryos from videos , 2005, IEEE Transactions on Image Processing.

[23]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[24]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[25]  Olac Fuentes,et al.  Knowledge Transfer in Deep convolutional Neural Nets , 2007, Int. J. Artif. Intell. Tools.

[26]  W. Duch,et al.  Intuition, Insight, Imagination and Creativity , 2007, IEEE Computational Intelligence Magazine.

[27]  Joseph F. Murray,et al.  Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning , 2007, Neural Computation.

[28]  Kaare Brandt Petersen,et al.  Bayesian independent component analysis: Variational methods and non-negative decompositions , 2007, Digit. Signal Process..

[29]  T. Shultz The Bayesian revolution approaches psychological development. , 2007, Developmental science.

[30]  Geoffrey E. Hinton,et al.  Deep, Narrow Sigmoid Belief Networks Are Universal Approximators , 2008, Neural Computation.