Learning Trans-Dimensional Random Fields with Applications to Language Modeling

To describe trans-dimensional observations in sample spaces of different dimensions, we propose a probabilistic model, called the trans-dimensional random field (TRF) by explicitly mixing a collection of random fields. In the framework of stochastic approximation (SA), we develop an effective training algorithm, called augmented SA, which jointly estimates the model parameters and normalizing constants while using trans-dimensional mixture sampling to generate observations of different dimensions. Furthermore, we introduce several statistical and computational techniques to improve the convergence of the training algorithm and reduce computational cost, which together enable us to successfully train TRF models on large datasets. The new model and training algorithm are thoroughly evaluated in a number of experiments. The word morphology experiment provides a benchmark test to study the convergence of the training algorithm and to compare with other algorithms, because log-likelihoods and gradients can be exactly calculated in this experiment. For language modeling, our experiments demonstrate the superiority of the TRF approach in being computationally more efficient in computing data probabilities by avoiding local normalization and being able to flexibly integrate a richer set of features, when compared with n-gram models and neural network models.

[1]  Brendan J. Frey,et al.  A comparison of algorithms for inference and learning in probabilistic graphical models , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[5]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[6]  José-Miguel Benedí,et al.  Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Features , 2001, ACL.

[7]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[8]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[9]  M. Gu,et al.  Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation , 2001 .

[10]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[11]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13]  H. Robbins A Stochastic Approximation Method , 1951 .

[14]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[15]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .

[16]  R. Carroll,et al.  Stochastic Approximation in Monte Carlo Computation , 2007 .

[17]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[18]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[21]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[22]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[23]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[25]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[26]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[27]  Bin Wang,et al.  Trans-dimensional Random Fields for Language Modeling , 2015, ACL.

[28]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[29]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[30]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[31]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[32]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[33]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[34]  Joris Pelemans,et al.  Sparse non-negative matrix language modeling for skip-grams , 2015, INTERSPEECH.

[35]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[36]  Han-Fu Chen Stochastic approximation and its applications , 2002 .

[37]  Z. Tan Optimally Adjusted Mixture Sampling and Locally Weighted Histogram Analysis , 2017 .

[38]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[39]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[40]  Tanel Alumäe,et al.  Using Dependency Grammar Features in Whole Sentence Maximum Entropy Language Model for Speech Recognition , 2010, Baltic HLT.

[41]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.