Backoff inspired features for maximum entropy language models

Maximum Entropy (MaxEnt) language models [1, 2] are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed ngram language models using Katz backoff [3] and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. Index Terms: maximum entropy modeling, language modeling, n-gram models, linear models

[1]  Jun Wu,et al.  Building a topic-dependent maximum entropy model for very large corpora , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[3]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[4]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[5]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[9]  Sophia Ananiadou,et al.  Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.

[10]  Jun Wu,et al.  Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Ruhi Sarikaya,et al.  Joint Morphological-Lexical Language Modeling for Processing Morphologically Rich Languages With Application to Dialectal Arabic , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[14]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[15]  Mikko Kurimo,et al.  Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[16]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[17]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[18]  Gideon S. Mann,et al.  MapReduce/Bigtable for Distributed Optimization , 2010 .

[19]  Hermann Ney,et al.  Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR , 2013, INTERSPEECH.

[20]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[21]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[22]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[23]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[24]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.