Training Restricted Boltzmann Machines on Word Observations

The restricted Boltzmann machine (RBM) is a flexible model for complex data. However, using RBMs for high-dimensional multinomial observations poses significant computational difficulties. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundred thousands. The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue with a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K. We demonstrate the success of our approach by training RBMs on hundreds of millions of word n-grams using larger vocabularies than previously feasible with RBMs and by using the learned features to improve performance on chunking and sentiment classification tasks, achieving state-of-the-art results on the latter.

[1]  R. Kronmal,et al.  On the Alias Method for Generating Random Variables From a Discrete Distribution , 1979 .

[2]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[3]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .

[4]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[7]  Blockin Blockin,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[8]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[9]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[10]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[11]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[12]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[15]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[16]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[17]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[18]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[19]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[20]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[21]  Geoffrey E. Hinton,et al.  Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images , 2010, AISTATS.

[22]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[23]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[24]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..