Self-taught learning

We introduce a new machine learning framework called self-taught learning for using unlabeled data in supervised classification tasks. This framework does not require that the unlabeled data follow the class labels of the supervised task, or arise from the same generative distribution. Such unlabeled data is often significantly easier to obtain than in previously studied frameworks such as semi-supervised learning. In this thesis, we demonstrate that self-taught learning can be applied successfully to a variety of hard machine learning problems. The centerpiece of our work is a self-taught learning algorithm based on an optimization problem called "sparse coding." This algorithm uses unlabeled data to learn a new representation for complex, high-dimensional inputs, and then applies supervised learning over this representation. The representation captures higher-level aspects of the input, and significantly improves classification performance on many test domains, including computer vision, audio recognition and text classification. We present efficient sparse coding algorithms for a translation-invariant version of the model, that can be applied to audio and image data. We also generalize the model to a much broader class of inputs, including domains that are hard to handle with previous algorithms, and apply the model to text classification and a robotic perception task. Taken together, these experiments demonstrate that using the self-taught learning framework, machine learning can be applied to much harder problems than previously possible. These self-taught learning algorithms work best when they are allowed to learn rich models (with millions of parameters) using large amounts of unlabeled data (millions of examples). Unfortunately, with current methods, it can take weeks to learn such rich models. Further, these methods require fast, sequential updates, and with current algorithms, are not conducive to being parallelized on a distributed cluster. To apply self-taught learning to such large-scale problems, we show that graphics processor hardware (available in most modern desktops) can be used to massively parallelize the algorithms. Using a new inherently parallel algorithm, the sparse coding algorithm can be easily implemented on graphics processors, and we show that this can reduce the learning time from about three weeks to a single day. Finally, we consider self-taught learning methods that learn hierarchical representations using unlabeled data. We develop general principles for unsupervised learning of such hierarchical models using graphics processors, and show that the slow learning algorithms for the popular deep belief network model can be successfully parallelized. This implementation is up to 70 times faster than an optimized CPU implementation, reduces the learning time from weeks to hours, and represents the state-of-the-art in learning large deep belief networks.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[3]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[4]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[5]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Volker Steinbiss,et al.  Cooccurrence smoothing for stochastic language modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[13]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[14]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[15]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[16]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[17]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[18]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[19]  H. Sebastian Seung,et al.  Unsupervised Learning by Convex and Conic Coding , 1996, NIPS.

[20]  Alan V. Oppenheim,et al.  Signals & systems (2nd ed.) , 1996 .

[21]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  Luc Van Gool,et al.  Affine/ Photometric Invariants for Planar Intensity Patterns , 1996, ECCV.

[24]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[27]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[28]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[29]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[30]  Terrence J. Sejnowski,et al.  Coding Time-Varying Signals Using Sparse, Shift-Invariant Representations , 1998, NIPS.

[31]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[32]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[33]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[34]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[35]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[36]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[37]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[38]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[39]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[40]  Alessandra Angelucci,et al.  Induction of visual orientation modules in auditory cortex , 2000, Nature.

[41]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[42]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[43]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[44]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[45]  Patrick Pantel,et al.  An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words , 2000, ACL.

[46]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[47]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[48]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[49]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[50]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[51]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.

[52]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[53]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[54]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[55]  Andrew Zisserman,et al.  Multi-view Matching for Unordered Image Sets, or "How Do I Organize My Holiday Snaps?" , 2002, ECCV.

[56]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[57]  C. Gross Genealogy of the “Grandmother Cell” , 2002, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[58]  J. Movshon,et al.  Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. , 2002, Journal of neurophysiology.

[59]  David J. Frank,et al.  Power-constrained CMOS scaling limits , 2002, IBM J. Res. Dev..

[60]  A. Hyvärinen,et al.  A multi-layer sparse coding network learns contour coding from natural images , 2002, Vision Research.

[61]  Rajat Raina,et al.  Classification with Hybrid Generative/Discriminative Models , 2003, NIPS.

[62]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[63]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[64]  Shimon Ullman,et al.  Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[65]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[66]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[67]  J. Koenderink,et al.  Representation of local geometry in the visual system , 1987, Biological Cybernetics.

[68]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[69]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[70]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[71]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[72]  Minami Ito,et al.  Representation of Angles Embedded within Contour Stimuli in Area V2 of Macaque Monkeys , 2004, The Journal of Neuroscience.

[73]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.

[74]  J. Hawkins,et al.  On Intelligence , 2004 .

[75]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[76]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[77]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[78]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[79]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[80]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[81]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[82]  C. Koch,et al.  Invariant visual representation by single neurons in the human brain , 2005, Nature.

[83]  Michael S. Lewicki,et al.  Efficient Coding of Time-Relative Structure Using Spikes , 2005, Neural Computation.

[84]  Cordelia Schmid,et al.  A sparse texture representation using local affine regions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[86]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[87]  Vikas Sindhwani,et al.  On Manifold Regularization , 2005, AISTATS.

[88]  Andrew Y. Ng,et al.  Transfer learning for text classification , 2005, NIPS.

[89]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[90]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[91]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[92]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[93]  Mike E. Davies,et al.  Sparse and shift-Invariant representations of music , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[94]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[95]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[96]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[97]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[98]  Jason Weston,et al.  Inference with the Universum , 2006, ICML.

[99]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[100]  Neil D. Lawrence,et al.  Gaussian Processes and the Null-Category Noise Model , 2006, Semi-Supervised Learning.

[101]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[102]  Daniel Jurafsky,et al.  Have we met? MDP based speaker ID for robot dialogue , 2006, INTERSPEECH.

[103]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[104]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[105]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[106]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[107]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[108]  Roger B. Grosse,et al.  Shift-Invariance Sparse Coding for Audio Classification , 2007, UAI.

[109]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[110]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[111]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[112]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[113]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[114]  Joseph F. Murray,et al.  Learning Sparse Overcomplete Codes for Images , 2006, J. VLSI Signal Process..

[115]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[116]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[117]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[118]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[119]  Shai Ben-David,et al.  Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning , 2008, COLT.

[120]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[121]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[122]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[123]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[124]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[125]  N. Dupuis-Roy,et al.  Uncovering gender discrimination cues in a realistic setting. , 2009, Journal of vision.

[126]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[127]  A. Ng,et al.  Exponential Family Sparse Coding with Application to Self-taught Learning , 2009, IJCAI.

[128]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[129]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[130]  Joseph J Atick,et al.  Could information theory provide an ecological theory of sensory processing? , 2011, Network.