论文信息 - Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals

Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals

Maximum Entropy (MaxEnt) language models are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework with a convex loss. MaxEnt models also have the advantage of scaling to large model and training data sizes We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of Automatic Speech Recognition (ASR); (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We evaluate the impact of these approaches on Google’s state-of-the-art ASR for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show significant improvements on a wide range of domains from using non-linguistic features. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 win to loss ratio.

[1] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2] Cyril Allauzen,et al. Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[3] Brian Roark,et al. Backoff inspired features for maximum entropy language models , 2014, INTERSPEECH.

[4] Ronald Rosenfeld,et al. A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[5] Joshua Goodman,et al. Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6] Ronald Rosenfeld,et al. Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Jun Wu,et al. Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[8] Roni Rosenfeld,et al. A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9] Ronald Rosenfeld,et al. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[10] Stanley F. Chen,et al. Shrinking Exponential Language Models , 2009, NAACL.

[11] Mikko Kurimo,et al. Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[12] Brian Roark,et al. Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[13] Gideon S. Mann,et al. MapReduce/Bigtable for Distributed Optimization , 2010 .

[14] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[15] Ronald Rosenfeld,et al. A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[16] Thorsten Brants,et al. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[17] Shankar Kumar,et al. Large Scale Language Modeling in Automatic Speech Recognition , 2012, ArXiv.

[18] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[19] Sophia Ananiadou,et al. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.