Trans-dimensional Random Fields for Language Modeling

Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random field (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential benefits, the empirical results of previous WSME models are not satisfactory. In this paper, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF (TDRF) model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and find that our TDRF models lead to performances as good as the recurrent neural network LMs but are computationally more efficient in computing sentence probability.

[1]  M. Gu,et al.  Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation , 2001 .

[2]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[3]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[5]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[6]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[7]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[8]  Geoffrey E. Hinton,et al.  Learning Multilevel Distributed Representations for High-Dimensional Sequences , 2007, AISTATS.

[9]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[11]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[12]  Misha Denil,et al.  Linear and Parallel Learning of Markov Random Fields , 2013, ICML.

[13]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[14]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[16]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[17]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[18]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[19]  R. Carroll,et al.  Stochastic Approximation in Monte Carlo Computation , 2007 .

[20]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[21]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[22]  Han-Fu Chen Stochastic approximation and its applications , 2002 .

[23]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[24]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[25]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .

[26]  José-Miguel Benedí,et al.  Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Features , 2001, ACL.

[27]  Tanel Alumäe,et al.  Using Dependency Grammar Features in Whole Sentence Maximum Entropy Language Model for Speech Recognition , 2010, Baltic HLT.

[28]  Z. Tan Optimally Adjusted Mixture Sampling and Locally Weighted Histogram Analysis , 2017 .