Learning Deep Architectures for AI

Theoretical results strongly suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one needs deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult optimization task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[3]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[4]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[5]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[6]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[7]  F. O'connor Energy Budget , 1971, Nature.

[8]  J. Piaget,et al.  The Origins of Intelligence in Children , 1971 .

[9]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[10]  Hans Hermes,et al.  Introduction to mathematical logic , 1973, Universitext.

[11]  M. A. Griffin,et al.  Information Processing Systems , 1976 .

[12]  James L. McClelland,et al.  An interactive activation model of context effects in letter perception: I. An account of basic findings. , 1981 .

[13]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[14]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[16]  A. Yao Separating the polynomial-time hierarchy by oracles , 1985 .

[17]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[18]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[19]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[20]  Rajesh Sharma,et al.  Asymptotic analysis , 1986 .

[21]  Johan Håstad,et al.  Almost optimal lower bounds for small depth circuits , 1986, STOC '86.

[22]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[23]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[24]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[25]  Elliott Mendelson,et al.  Introduction to mathematical logic (3. ed.) , 1987 .

[26]  Ingo Wegener,et al.  The complexity of Boolean functions , 1987 .

[27]  Yann LeCun,et al.  Memoires associatives distribuees: Une comparaison (Distributed associative memories: A comparison) , 1987 .

[28]  James L. McClelland,et al.  Explorations in parallel distributed processing: a handbook of models, programs, and exercises , 1988 .

[29]  J. Stephen Judd,et al.  Learning in neural networks , 1988, COLT '88.

[30]  James L. McClelland Explorations In Parallel Distributed Processing , 1988 .

[31]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[32]  Geoffrey E. Hinton,et al.  Parallel Models of Associative Memory , 1989 .

[33]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[34]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[35]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[36]  E. Allgower,et al.  Numerical Continuation Methods , 1990 .

[37]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[38]  Jordan B. Pollack,et al.  Recursive Distributed Representations , 1990, Artif. Intell..

[39]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[40]  Risto Miikkulainen,et al.  Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[41]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[42]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[43]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[44]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[45]  Yann LeCun,et al.  Efficient Pattern Recognition Using a New Transformation Distance , 1992, NIPS.

[46]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[47]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[48]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[49]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[50]  Maurice Milgram,et al.  Transformation Invariant Autoassociation with Application to Handwritten Character Recognition , 1994, NIPS.

[51]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[52]  Pekka Orponen,et al.  Computational complexity of neural networks: a survey , 1994 .

[53]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[54]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[55]  Peter Tiňo,et al.  Learning long-term dependencies is not as difficult with NARX recurrent neural networks , 1995 .

[56]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[57]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[58]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[59]  J. J. Moré,et al.  Global continuation for distance geometry problems , 1995 .

[60]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[61]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[62]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[63]  J. J. Moré,et al.  Smoothing techniques for macromolecular global optimization , 1995 .

[64]  Zhi-jun Wu Global Continuation for Distance Geometry Problems Global Continuation for Distance Geometry Problems , 1995 .

[65]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[66]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[67]  Nathan Intrator,et al.  How to Make a Low-Dimensional Representation Suitable for Diverse Tasks , 1996 .

[68]  Larry A. Rendell,et al.  Learning Despite Concept Variation by Finding Structure in Attribute-based Data , 1996, ICML.

[69]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[70]  Barak A. Pearlmutter,et al.  A Context-Sensitive Generalization of ICA , 1996 .

[71]  Thomas F. Coleman,et al.  Parallel continuation-based global optimization for molecular conformation and protein folding , 1994, J. Glob. Optim..

[72]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[73]  Larry A. Rendell,et al.  Global Data Analysis and the Fragmentation Problem in Decision Tree Induction , 1997, ECML.

[74]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[75]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[76]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Graduate Texts in Computer Science.

[77]  H. Sebastian Seung,et al.  Learning Continuous Attractors in Recurrent Networks , 1997, NIPS.

[78]  Jorge J. Moré,et al.  Global Continuation for Distance Geometry Problems , 1995, SIAM J. Optim..

[79]  Terrence J. Sejnowski,et al.  Learning Nonlinear Overcomplete Representations for Efficient Coding , 1997, NIPS.

[80]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[81]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[82]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[83]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[84]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[85]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[86]  Dima Grigoriev,et al.  Complexity Lower Bounds for Approximation Algebraic Computation Trees , 1999, J. Complex..

[87]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[88]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[89]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[90]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[91]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[92]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[93]  Nathalie Japkowicz,et al.  Nonlinear Autoassociation Is Not Equivalent to PCA , 2000, Neural Computation.

[94]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[95]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[96]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[97]  Geoffrey E. Hinton,et al.  Extracting distributed representations of concepts and relations from positive and negative propositions , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[98]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[99]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[100]  E. Oja,et al.  Independent Component Analysis , 2013 .

[101]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[102]  S. Laughlin,et al.  An Energy Budget for Signaling in the Grey Matter of the Brain , 2001, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[103]  Yee Whye Teh,et al.  A New View of ICA , 2001 .

[104]  Lei Wang,et al.  Learning kernel parameters by using class separability measure , 2002 .

[105]  Geoffrey E. Hinton,et al.  Self Supervised Boosting , 2002, NIPS.

[106]  Mikhail Belkin,et al.  Using manifold structure for partially labelled classification , 2002, NIPS 2002.

[107]  Paul E. Utgoff,et al.  Many-Layered Learning , 2002, Neural Computation.

[108]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[109]  Clifton B. Chadwick What is learning , 2002 .

[110]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[111]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[112]  Matthew Brand,et al.  Charting a Manifold , 2002, NIPS.

[113]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  Michael Schmitt,et al.  Descartes' Rule of Signs for Radial Basis Function Neural Networks , 2002, Neural Computation.

[115]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[116]  Ahmad Emami,et al.  Training Connectionist Models for the Structured Language Model , 2003, EMNLP.

[117]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[118]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[119]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[120]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[121]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[122]  P. Lennie The Cost of Cortical Computation , 2003, Current Biology.

[123]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[124]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[125]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[126]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[127]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[128]  G. Peterson A day of great illumination: B. F. Skinner's discovery of shaping. , 2004, Journal of the experimental analysis of behavior.

[129]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[130]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[131]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[132]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[133]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[134]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[135]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[136]  H. Schwenk,et al.  Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[137]  Yoshua Bengio,et al.  Non-Local Manifold Tangent Learning , 2004, NIPS.

[138]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[139]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[140]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[141]  R. Guillery Is postnatal neocortical maturation hierarchical? , 2005, Trends in Neurosciences.

[142]  Marcus Hutter Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[143]  Johan Håstad,et al.  On the power of small-depth threshold circuits , 1991, computational complexity.

[144]  L. Bottou,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[145]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[146]  Nicolas Le Roux,et al.  Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[147]  Jean-Luc Gauvain,et al.  Building continuous space language models for transcribing european languages , 2005, INTERSPEECH.

[148]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[149]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[150]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[151]  Nicolas Le Roux,et al.  The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[152]  Michael S. Lewicki,et al.  A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels , 2005, NIPS.

[153]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[154]  Yann LeCun,et al.  Loss Functions for Discriminative Training of Energy-Based Models , 2005, AISTATS.

[155]  Brian Hazlehurst,et al.  How to invent a lexicon: the development of shared symbols in interaction , 2006 .

[156]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[157]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[158]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[159]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[160]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[161]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[162]  Yoshua Bengio,et al.  Nonlocal Estimation of Manifold Structure , 2006, Neural Computation.

[163]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[164]  Yee Whye Teh,et al.  Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation , 2006, Cogn. Sci..

[165]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[166]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[167]  Max Welling Donald,et al.  Products of Experts , 2007 .

[168]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[169]  Roger B. Grosse,et al.  Shift-Invariance Sparse Coding for Audio Classification , 2007, UAI.

[170]  Aapo Hyvärinen,et al.  Connections Between Score Matching, Contrastive Divergence, and Pseudolikelihood for Continuous-Valued Variables , 2007, IEEE Transactions on Neural Networks.

[171]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[172]  David G. Lowe,et al.  University of British Columbia. , 1945, Canadian Medical Association journal.

[173]  Antonio Torralba,et al.  Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[174]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[175]  Ivan Titov,et al.  Constituent Parsing with Incremental Sigmoid Belief Networks , 2007, ACL.

[176]  Marc'Aurelio Ranzato,et al.  A Unified Energy-Based Framework for Unsupervised Learning , 2007, AISTATS.

[177]  Yann LeCun,et al.  A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[178]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[179]  Geoffrey E. Hinton,et al.  To recognize shapes, first learn to generate images. , 2007, Progress in brain research.

[180]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[181]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[182]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[183]  Geoffrey E. Hinton,et al.  Modeling image patches with a directed hierarchy of Markov random fields , 2007, NIPS.

[184]  Thomas Serre,et al.  A quantitative theory of immediate visual recognition. , 2007, Progress in brain research.

[185]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[186]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[187]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[188]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[189]  Geoffrey E. Hinton,et al.  Learning Multilevel Distributed Representations for High-Dimensional Sequences , 2007, AISTATS.

[190]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[191]  Katherine A. Heller,et al.  A Nonparametric Bayesian Approach to Modeling Overlapping Clusters , 2007, AISTATS.

[192]  Aapo Hyvärinen,et al.  Some extensions of score matching , 2007, Comput. Stat. Data Anal..

[193]  Geoffrey E. Hinton,et al.  Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes , 2007, NIPS.

[194]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[195]  Aapo Hyvärinen,et al.  A Two-Layer ICA-Like Model Estimated by Score Matching , 2007, ICANN.

[196]  Yann LeCun,et al.  Deep belief net learning in a long-range vision system for autonomous off-road driving , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[197]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[198]  Yihong Gong,et al.  Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks , 2008, ECCV.

[199]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[200]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[201]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[202]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[203]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[204]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[205]  Michael I. Jordan,et al.  An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[206]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[207]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[208]  Katherine A. Heller,et al.  Statistical models for partial membership , 2008, ICML '08.

[209]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[210]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[211]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[212]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[213]  Botond Cseke,et al.  Advances in Neural Information Processing Systems 20 (NIPS 2007) , 2008 .

[214]  David M. Bradley,et al.  Differentiable Sparse Coding , 2008, NIPS.

[215]  Nicolas Pinto,et al.  Establishing Good Benchmarks and Baselines for Face Recognition , 2008 .

[216]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[217]  Yoshua Bengio,et al.  Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.

[218]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[219]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[220]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[221]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[222]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[223]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.

[224]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[225]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[226]  P. Dayan,et al.  Flexible shaping: How learning in small steps helps , 2009, Cognition.

[227]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.