Distributed acoustic modeling with back-off n-grams

The paper proposes an approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model) to approximately 100 times larger than current sizes used in ASR. Dealing with unseen phonetic contexts is accomplished using the familiar back-off technique used in language modeling due to implementation simplicity. The new acoustic model is estimated and stored using the Map-Reduce distributed computing infrastructure. Speech recognition experiments are carried out in an N-best rescoring framework for Google Voice Search. 87,000 hours of training data is obtained in an unsupervised fashion by filtering utterances in Voice Search logs on ASR confidence. The resulting models are trained using maximum likelihood and contain 20-40 million Gaussians. They achieve relative reductions in WER of 11% and 6% over first-pass models trained using maximum likelihood, and boosted MMI, respectively.

[1]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[2]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[3]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[4]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[5]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Richard M. Schwartz,et al.  Improved hidden Markov modeling of phonemes for continuous speech recognition , 1984, ICASSP.

[7]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[8]  Thorsten Brants,et al.  Distributed Language Models , 2009, NAACL.

[9]  Steve Young,et al.  The HTK book , 1995 .

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[13]  Johan Schalkwyk,et al.  Query language modeling for voice search , 2010, 2010 IEEE Spoken Language Technology Workshop.

[14]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[16]  Naonori Ueda,et al.  Variational bayesian estimation and clustering for speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[17]  Fernando Pereira,et al.  Distributed acoustic modeling with back-off n-grams , 2012, ICASSP.

[18]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[19]  Thomas Hain,et al.  Recent advances in broadcast news transcription , 2003 .

[20]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..