论文信息 - Learning Deep Architectures for AI

Learning Deep Architectures for AI

Theoretical results strongly suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one needs deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult optimization task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.

Yoshua. Bengio | Yoshua Bengio

[1] H. Hotelling. Analysis of a complex of statistical variables into principal components. , 1933 .

[2] D. Hubel,et al. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[3] Ray J. Solomonoff,et al. A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[4] Ray J. Solomonoff,et al. A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[5] A. Kolmogorov. Three approaches to the quantitative definition of information , 1968 .

[6] C. S. Wallace,et al. An Information Measure for Classification , 1968, Comput. J..

[7] F. O'connor. Energy Budget , 1971, Nature.

[8] J. Piaget,et al. The Origins of Intelligence in Children , 1971 .

[9] J. M. Hammersley,et al. Markov fields on finite graphs and lattices , 1971 .

[10] Hans Hermes,et al. Introduction to mathematical logic , 1973, Universitext.

[11] M. A. Griffin,et al. Information Processing Systems , 1976 .

[12] James L. McClelland,et al. An interactive activation model of context effects in letter perception: I. An account of basic findings. , 1981 .

[13] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.

[14] Donald Geman,et al. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Geoffrey E. Hinton,et al. A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[16] A. Yao. Separating the polynomial-time hierarchy by oracles , 1985 .

[17] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[18] L. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[19] Paul Smolensky,et al. Information processing in dynamical systems: foundations of harmony theory , 1986 .

[20] Rajesh Sharma,et al. Asymptotic analysis , 1986 .

[21] Johan Håstad,et al. Almost optimal lower bounds for small depth circuits , 1986, STOC '86.

[22] Geoffrey E. Hinton,et al. Learning and relearning in Boltzmann machines , 1986 .

[23] James L. McClelland,et al. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[24] S. Duane,et al. Hybrid Monte Carlo , 1987 .

[25] Elliott Mendelson,et al. Introduction to mathematical logic (3. ed.) , 1987 .

[26] Ingo Wegener,et al. The complexity of Boolean functions , 1987 .

[27] Yann LeCun,et al. Memoires associatives distribuees: Une comparaison (Distributed associative memories: A comparison) , 1987 .

[28] James L. McClelland,et al. Explorations in parallel distributed processing: a handbook of models, programs, and exercises , 1988 .

[29] J. Stephen Judd,et al. Learning in neural networks , 1988, COLT '88.

[30] James L. McClelland. Explorations In Parallel Distributed Processing , 1988 .

[31] Geoffrey E. Hinton. Learning distributed representations of concepts. , 1989 .

[32] Geoffrey E. Hinton,et al. Parallel Models of Associative Memory , 1989 .

[33] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[34] W S McCulloch,et al. A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[35] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[36] E. Allgower,et al. Numerical Continuation Methods , 1990 .

[37] W. Pitts,et al. A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[38] Jordan B. Pollack,et al. Recursive Distributed Representations , 1990, Artif. Intell..

[39] Eugene L. Allgower,et al. Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[40] Risto Miikkulainen,et al. Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[41] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[42] David Haussler,et al. Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[43] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[44] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[45] Yann LeCun,et al. Efficient Pattern Recognition Using a New Transformation Distance , 1992, NIPS.

[46] Radford M. Neal. Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[47] Geoffrey E. Hinton,et al. Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[48] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[49] J. Elman. Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[50] Maurice Milgram,et al. Transformation Invariant Autoassociation with Application to Handwritten Character Recognition , 1994, NIPS.

[51] David A. Cohn,et al. Active Learning with Statistical Models , 1996, NIPS.

[52] Pekka Orponen,et al. Computational complexity of neural networks: a survey , 1994 .

[53] G. Kane. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[54] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[55] Peter Tiňo,et al. Learning long-term dependencies is not as difficult with NARX recurrent neural networks , 1995 .

[56] Carl E. Rasmussen,et al. In Advances in Neural Information Processing Systems , 2011 .

[57] Terrence J. Sejnowski,et al. An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[58] Sebastian Thrun,et al. Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[59] J. J. Moré,et al. Global continuation for distance geometry problems , 1995 .

[60] Geoffrey E. Hinton,et al. The Helmholtz Machine , 1995, Neural Computation.

[61] Geoffrey E. Hinton,et al. The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[62] Jonathan Baxter,et al. Learning internal representations , 1995, COLT '95.

[63] J. J. Moré,et al. Smoothing techniques for macromolecular global optimization , 1995 .

[64] Zhi-jun Wu. Global Continuation for Distance Geometry Problems Global Continuation for Distance Geometry Problems , 1995 .

[65] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[66] Michael I. Jordan,et al. Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[67] Nathan Intrator,et al. How to Make a Low-Dimensional Representation Suitable for Diverse Tasks , 1996 .

[68] Larry A. Rendell,et al. Learning Despite Concept Variation by Finding Structure in Attribute-based Data , 1996, ICML.

[69] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[70] Barak A. Pearlmutter,et al. A Context-Sensitive Generalization of ICA , 1996 .

[71] Thomas F. Coleman,et al. Parallel continuation-based global optimization for molecular conformation and protein folding , 1994, J. Glob. Optim..

[72] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[73] Larry A. Rendell,et al. Global Data Analysis and the Fragmentation Problem in Decision Tree Induction , 1997, ECML.

[74] William I. Gasarch,et al. Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[75] Geoffrey E. Hinton,et al. Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[76] Paul M. B. Vitányi,et al. An Introduction to Kolmogorov Complexity and Its Applications , 1997, Graduate Texts in Computer Science.

[77] H. Sebastian Seung,et al. Learning Continuous Attractors in Recurrent Networks , 1997, NIPS.

[78] Jorge J. Moré,et al. Global Continuation for Distance Geometry Problems , 1995, SIAM J. Optim..

[79] Terrence J. Sejnowski,et al. Learning Nonlinear Overcomplete Representations for Efficient Coding , 1997, NIPS.

[80] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[81] Bernhard Schölkopf,et al. Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[82] Michael I. Jordan. Learning in Graphical Models , 1999, NATO ASI Series.

[83] Jorma Rissanen,et al. Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[84] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[85] David Haussler,et al. Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[86] Dima Grigoriev,et al. Complexity Lower Bounds for Approximation Algebraic Computation Trees , 1999, J. Complex..

[87] B. Schölkopf,et al. Advances in kernel methods: support vector learning , 1999 .

[88] Gunnar Rätsch,et al. Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[89] Terrence J. Sejnowski,et al. Unsupervised Learning , 2018, Encyclopedia of GIS.

[90] Yair Weiss,et al. Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[91] Pietro Perona,et al. Unsupervised Learning of Models for Recognition , 2000, ECCV.

[92] J. Tenenbaum,et al. A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[93] Nathalie Japkowicz,et al. Nonlinear Autoassociation Is Not Equivalent to PCA , 2000, Neural Computation.

[94] Terrence J. Sejnowski,et al. Learning Overcomplete Representations , 2000, Neural Computation.

[95] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[96] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[97] Geoffrey E. Hinton,et al. Extracting distributed representations of concepts and relations from positive and negative propositions , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[98] S T Roweis,et al. Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[99] N. Cristianini,et al. On Kernel-Target Alignment , 2001, NIPS.

[100] E. Oja,et al. Independent Component Analysis , 2013 .

[101] Michael I. Jordan,et al. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[102] S. Laughlin,et al. An Energy Budget for Signaling in the Grey Matter of the Brain , 2001, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[103] Yee Whye Teh,et al. A New View of ICA , 2001 .

[104] Lei Wang,et al. Learning kernel parameters by using class separability measure , 2002 .

[105] Geoffrey E. Hinton,et al. Self Supervised Boosting , 2002, NIPS.

[106] Mikhail Belkin,et al. Using manifold structure for partially labelled classification , 2002, NIPS 2002.

[107] Paul E. Utgoff,et al. Many-Layered Learning , 2002, Neural Computation.

[108] Terrence J. Sejnowski,et al. Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[109] Clifton B. Chadwick. What is learning , 2002 .

[110] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[111] Thomas G. Dietterich,et al. Editors. Advances in Neural Information Processing Systems , 2002 .

[112] Matthew Brand,et al. Charting a Manifold , 2002, NIPS.

[113] Jean-Luc Gauvain,et al. Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114] Michael Schmitt,et al. Descartes' Rule of Signs for Radial Basis Function Neural Networks , 2002, Neural Computation.

[115] Zoubin Ghahramani,et al. Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[116] Ahmad Emami,et al. Training Connectionist Models for the Structured Language Model , 2003, EMNLP.

[117] Bernhard Schölkopf,et al. Learning with Local and Global Consistency , 2003, NIPS.

[118] Tai Sing Lee,et al. Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[119] Patrice Y. Simard,et al. Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[120] Thomas Gärtner,et al. A survey of kernels for structured data , 2003, SKDD.

[121] Yee Whye Teh,et al. Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[122] P. Lennie. The Cost of Cortical Computation , 2003, Current Biology.

[123] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[124] Nello Cristianini,et al. Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[125] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[126] Jonathan Baxter,et al. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[127] Geoffrey E. Hinton,et al. Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[128] G. Peterson. A day of great illumination: B. F. Skinner's discovery of shaping. , 2004, Journal of the experimental analysis of behavior.

[129] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[130] Nando de Freitas,et al. An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[131] Kunihiko Fukushima,et al. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[132] Mehryar Mohri,et al. Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[133] Robert Tibshirani,et al. The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[134] Mikhail Belkin,et al. Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[135] Nicolas Le Roux,et al. Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[136] H. Schwenk,et al. Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[137] Yoshua Bengio,et al. Non-Local Manifold Tangent Learning , 2004, NIPS.

[138] Samy Bengio,et al. Links between perceptrons, MLPs and SVMs , 2004, ICML.

[139] Y. LeCun,et al. Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[140] H. Bourlard,et al. Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[141] R. Guillery. Is postnatal neocortical maturation hierarchical? , 2005, Trends in Neurosciences.

[142] Marcus Hutter. Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[143] Johan Håstad,et al. On the power of small-depth threshold circuits , 1991, computational complexity.

[144] L. Bottou,et al. Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[145] Nicolas Le Roux,et al. Convex Neural Networks , 2005, NIPS.

[146] Nicolas Le Roux,et al. Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[147] Jean-Luc Gauvain,et al. Building continuous space language models for transcribing european languages , 2005, INTERSPEECH.

[148] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[149] Aapo Hyvärinen,et al. Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[150] Emmanuel J. Candès,et al. Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[151] Nicolas Le Roux,et al. The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[152] Michael S. Lewicki,et al. A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels , 2005, NIPS.

[153] Miguel Á. Carreira-Perpiñán,et al. On Contrastive Divergence Learning , 2005, AISTATS.

[154] Yann LeCun,et al. Loss Functions for Discriminative Training of Energy-Based Models , 2005, AISTATS.

[155] Brian Hazlehurst,et al. How to invent a lexicon: the development of shared symbols in interaction , 2006 .

[156] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[157] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[158] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[159] Geoffrey E. Hinton,et al. Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[160] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[161] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[162] Yoshua Bengio,et al. Nonlocal Estimation of Manifold Structure , 2006, Neural Computation.

[163] Marc'Aurelio Ranzato,et al. Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[164] Yee Whye Teh,et al. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation , 2006, Cogn. Sci..

[165] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.

[166] Tom Minka,et al. Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[167] Max Welling Donald,et al. Products of Experts , 2007 .

[168] Geoffrey E. Hinton,et al. Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[169] Roger B. Grosse,et al. Shift-Invariance Sparse Coding for Audio Classification , 2007, UAI.

[170] Aapo Hyvärinen,et al. Connections Between Score Matching, Contrastive Divergence, and Pseudolikelihood for Continuous-Valued Variables , 2007, IEEE Transactions on Neural Networks.

[171] Honglak Lee,et al. Sparse deep belief net model for visual area V2 , 2007, NIPS.

[172] David G. Lowe,et al. University of British Columbia. , 1945, Canadian Medical Association journal.

[173] Antonio Torralba,et al. Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[174] Geoffrey E. Hinton,et al. Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[175] Ivan Titov,et al. Constituent Parsing with Incremental Sigmoid Belief Networks , 2007, ACL.

[176] Marc'Aurelio Ranzato,et al. A Unified Energy-Based Framework for Unsupervised Learning , 2007, AISTATS.

[177] Yann LeCun,et al. A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[178] Marc'Aurelio Ranzato,et al. Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[179] Geoffrey E. Hinton,et al. To recognize shapes, first learn to generate images. , 2007, Progress in brain research.

[180] Jason Weston,et al. Large-scale kernel machines , 2007 .

[181] Yoshua Bengio,et al. Scaling learning algorithms towards AI , 2007 .

[182] Juan Carlos Niebles,et al. A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[183] Geoffrey E. Hinton,et al. Modeling image patches with a directed hierarchy of Markov random fields , 2007, NIPS.

[184] Thomas Serre,et al. A quantitative theory of immediate visual recognition. , 2007, Progress in brain research.

[185] Geoffrey E. Hinton,et al. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[186] Rajat Raina,et al. Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[187] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[188] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[189] Geoffrey E. Hinton,et al. Learning Multilevel Distributed Representations for High-Dimensional Sequences , 2007, AISTATS.

[190] Marc'Aurelio Ranzato,et al. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[191] Katherine A. Heller,et al. A Nonparametric Bayesian Approach to Modeling Overlapping Clusters , 2007, AISTATS.

[192] Aapo Hyvärinen,et al. Some extensions of score matching , 2007, Comput. Stat. Data Anal..

[193] Geoffrey E. Hinton,et al. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes , 2007, NIPS.

[194] Alex Bateman,et al. An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[195] Aapo Hyvärinen,et al. A Two-Layer ICA-Like Model Estimated by Score Matching , 2007, ICANN.

[196] Yann LeCun,et al. Deep belief net learning in a long-range vision system for autonomous off-road driving , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[197] Ruslan Salakhutdinov,et al. On the quantitative analysis of deep belief networks , 2008, ICML '08.

[198] Yihong Gong,et al. Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks , 2008, ECCV.

[199] Marc'Aurelio Ranzato,et al. Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[200] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[201] Nicolas Le Roux,et al. Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[202] Ruslan Salakhutdinov,et al. Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[203] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[204] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[205] Michael I. Jordan,et al. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[206] Jason Weston,et al. Deep learning via semi-supervised embedding , 2008, ICML '08.

[207] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[208] Katherine A. Heller,et al. Statistical models for partial membership , 2008, ICML '08.

[209] Antonio Torralba,et al. Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[210] Guillermo Sapiro,et al. Supervised Dictionary Learning , 2008, NIPS.

[211] Yoshua Bengio,et al. Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[212] Geoffrey E. Hinton,et al. A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[213] Botond Cseke,et al. Advances in Neural Information Processing Systems 20 (NIPS 2007) , 2008 .

[214] David M. Bradley,et al. Differentiable Sparse Coding , 2008, NIPS.

[215] Nicolas Pinto,et al. Establishing Good Benchmarks and Baselines for Face Recognition , 2008 .

[216] Geoffrey E. Hinton,et al. Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[217] Yoshua Bengio,et al. Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.

[218] Yoshua Bengio,et al. Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[219] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[220] Geoffrey E. Hinton,et al. Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[221] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.

[222] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[223] Yoshua Bengio,et al. Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.

[224] Pascal Vincent,et al. The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[225] Geoffrey E. Hinton,et al. Semantic hashing , 2009, Int. J. Approx. Reason..

[226] P. Dayan,et al. Flexible shaping: How learning in small steps helps , 2009, Cognition.

[227] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.