A Fast Variational Approach for Learning Markov Random Field Language Models

Language modelling is a fundamental building block of natural language processing. However, in practice the size of the vocabulary limits the distributions applicable for this task: specifically, one has to either resort to local optimization methods, such as those used in neural language models, or work with heavily constrained distributions. In this work, we take a step towards overcoming these difficulties. We present a method for global-likelihood optimization of a Markov random field language model exploiting long-range contexts in time independent of the corpus size. We take a variational approach to optimizing the likelihood and exploit underlying symmetries to greatly simplify learning. We demonstrate the efficiency of this method both for language modelling and for part-of-speech tagging.

[1]  Jeremy Jancsary,et al.  Convergent Decomposition Solvers for Tree-reweighted Free Energies , 2011, AISTATS.

[2]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[3]  Hung Hai Bui,et al.  Lifted Tree-Reweighted Variational Inference , 2014, UAI.

[4]  Michael I. Jordan,et al.  An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[5]  Tommi S. Jaakkola,et al.  Convergent Propagation Algorithms via Oriented Trees , 2007, UAI.

[6]  Hung Hai Bui,et al.  Automorphism Groups of Graphical Models and Lifted Variational Inference , 2012, UAI.

[7]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[8]  Yair Weiss,et al.  Minimizing and Learning Energy Functions for Side-Chain Prediction , 2007, RECOMB.

[9]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[10]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[13]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[14]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[15]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[16]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[19]  Tommi S. Jaakkola,et al.  Learning Efficiently with Approximate Inference via Dual Losses , 2010, ICML.

[20]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .