The IBM 2015 English conversational telephone speech recognition system

We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and output layers and retraining the resulting model; and lastly, sophisticated language model rescoring with exponential and neural network LMs. These techniques result in an 8.0% word error rate on the Switchboard part of the Hub5-2000 evaluation test set which is 23% relative better than our previous best published result.

[1]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[2]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Thomas Hain,et al.  THE CU-HTK MARCH 2000 HUB5E TRANSCRIPTION SYSTEM , 2000 .

[6]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Ahmad Emami,et al.  A Neural Syntactic Language Model , 2005, Machine Learning.

[9]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[10]  Ahmad Emami,et al.  Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[12]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[13]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[15]  Ebru Arisoy,et al.  Large Scale Hierarchical Neural Network Language Models , 2012, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[20]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[21]  Andrew L. Maas,et al.  Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition , 2014, arXiv.org.

[22]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[23]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Li-Rong Dai,et al.  Sequence training of multiple deep neural networks for better performance and faster training speed , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Unfolded recurrent neural networks for speech recognition , 2014, INTERSPEECH.

[29]  Vaibhava Goel,et al.  Annealed dropout training of deep networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[30]  Daniel Jurafsky,et al.  Building DNN acoustic models for large vocabulary speech recognition , 2014, Comput. Speech Lang..