Learning Distributed Representations for Statistical Language Modelling and Collaborative Filtering

With the increasing availability of large datasets machine learning techniques are be- coming an increasingly attractive alternative to expert-designed approaches to solving complex problems in domains where data is abundant. In this thesis we introduce several models for large sparse discrete datasets. Our approach, which is based on probabilistic models that use distributed representations to alleviate the effects of data sparsity, is applied to statistical language modelling and collaborative filtering. We introduce three probabilistic language models that represent words using learned real-valued vectors. Two of the models are based on the Restricted Boltzmann Machine (RBM) architecture while the third one is a simple deterministic model. We show that the deterministic model outperforms the widely used n-gram models and learns sensible word representations. To reduce the time complexity of training and making predictions with the deterministic model, we introduce a hierarchical version of the model, that can be exponentially faster. The speedup is achieved by structuring the vocabulary as a tree over words and taking advantage of this structure. We propose a simple feature-based algorithm for automatic construction of trees over words from data and show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. We then turn our attention to collaborative filtering and show how RBM models can be used to model the distribution of sparse high-dimensional user rating vectors efficiently, presenting inference and learning algorithms that scale linearly in the number of observed ratings. We also introduce the Probabilistic Matrix Factorization model which is based on the probabilistic formulation of the low-rank matrix approximation problem for partially observed matrices. The two models are then extended to allow conditioning on the identities of the rated items whether or not the actual rating values are known. Our results on the Netflix Prize dataset show that both RBM and PMF models outperform online SVD models.

[1]  Richard S. Zemel,et al.  Collaborative Filtering and the Missing at Random Assumption , 2007, UAI.

[2]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[3]  Yihong Gong,et al.  Stochastic Relational Models for Large-scale Dyadic Data using MCMC , 2008, NIPS.

[4]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[5]  Benjamin M. Marlin,et al.  Modeling User Rating Profiles For Collaborative Filtering , 2003, NIPS.

[6]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[7]  John Langford,et al.  Conditional Probability Tree Estimation Analysis and Algorithms , 2009, UAI.

[8]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[9]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Blockin Blockin,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[11]  Radford M. Neal,et al.  Improving Classification When a Class Hierarchy is Available Using a Hierarchy-Based Prior , 2005, math/0510449.

[12]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[13]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[14]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[15]  Thomas Hofmann,et al.  Learning What People (Don't) Want , 2001, ECML.

[16]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[17]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[18]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.

[19]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Geoffrey E. Hinton,et al.  Learning Multilevel Distributed Representations for High-Dimensional Sequences , 2007, AISTATS.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[23]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[24]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[25]  Benjamin M. Marlin,et al.  Collaborative Filtering: A Machine Learning Perspective , 2004 .

[26]  Domonkos Tikk,et al.  Matrix factorization and neighbor based algorithms for the netflix prize problem , 2008, RecSys '08.

[27]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[28]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[29]  Richard S. Zemel,et al.  The multiple multiplicative factor model for collaborative filtering , 2004, ICML.

[30]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[31]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[32]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[33]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[34]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[35]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[36]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[37]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[38]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[39]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[40]  Ahmad Emami,et al.  Using a connectionist model in a syntactical based language model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Francis Jack Smith,et al.  Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies , 1995, Comput. Linguistics.

[42]  John Blitzer,et al.  Distributed Latent Variable Models of Lexical Co-occurrences , 2005, AISTATS.

[43]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[44]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[45]  Yew Jin Lim Variational Bayesian Approach to Movie Rating Prediction , 2007 .

[46]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[47]  James Bennett,et al.  The Netflix Prize , 2007 .

[48]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[49]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[50]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[51]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[52]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[53]  Svetha Venkatesh,et al.  Ordinal Boltzmann Machines for Collaborative Filtering , 2009, UAI.

[54]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[55]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[56]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[57]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[58]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[59]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[60]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[61]  Arkadiusz Paterek,et al.  Improving regularized singular value decomposition for collaborative filtering , 2007 .

[62]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[63]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[64]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[65]  John Blitzer,et al.  Hierarchical Distributed Representations for Statistical Language Modeling , 2004, NIPS.