Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals

Maximum Entropy (MaxEnt) language models are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework with a convex loss. MaxEnt models also have the advantage of scaling to large model and training data sizes We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of Automatic Speech Recognition (ASR); (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We evaluate the impact of these approaches on Google’s state-of-the-art ASR for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show significant improvements on a wide range of domains from using non-linguistic features. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 win to loss ratio.

[1]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[3]  Brian Roark,et al.  Backoff inspired features for maximum entropy language models , 2014, INTERSPEECH.

[4]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[5]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jun Wu,et al.  Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[8]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[10]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[11]  Mikko Kurimo,et al.  Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[12]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[13]  Gideon S. Mann,et al.  MapReduce/Bigtable for Distributed Optimization , 2010 .

[14]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[15]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[16]  Thorsten Brants,et al.  Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[17]  Shankar Kumar,et al.  Large Scale Language Modeling in Automatic Speech Recognition , 2012, ArXiv.

[18]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[19]  Sophia Ananiadou,et al.  Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.