论文信息 - Backoff inspired features for maximum entropy language models

Backoff inspired features for maximum entropy language models

Maximum Entropy (MaxEnt) language models [1, 2] are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed ngram language models using Katz backoff [3] and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. Index Terms: maximum entropy modeling, language modeling, n-gram models, linear models

Brian Roark | Fadi Biadsy | Keith B. Hall | Pedro J. Moreno

[1] Jun Wu,et al. Building a topic-dependent maximum entropy model for very large corpora , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[3] Ronald Rosenfeld,et al. A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[4] Sanjeev Khudanpur,et al. Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[5] Ronald Rosenfeld,et al. Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Yoshua Bengio,et al. Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Cyril Allauzen,et al. Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[9] Sophia Ananiadou,et al. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.

[10] Jun Wu,et al. Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[11] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12] Ruhi Sarikaya,et al. Joint Morphological-Lexical Language Modeling for Processing Morphologically Rich Languages With Application to Dialectal Arabic , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Francoise Beaufays,et al. “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[14] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[15] Mikko Kurimo,et al. Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[16] Brian Roark,et al. Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[17] Thorsten Brants,et al. Large Language Models in Machine Translation , 2007, EMNLP.

[18] Gideon S. Mann,et al. MapReduce/Bigtable for Distributed Optimization , 2010 .

[19] Hermann Ney,et al. Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR , 2013, INTERSPEECH.

[20] Roni Rosenfeld,et al. A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[21] Ronald Rosenfeld,et al. A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[22] John N. Tsitsiklis,et al. Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[23] Ronald Rosenfeld,et al. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[24] Gideon S. Mann,et al. Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.