Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

[1]  Kenneth C. Knowlton,et al.  A fast storage allocator , 1965, CACM.

[2]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[6]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[7]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[10]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[12]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[13]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[14]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15]  Ashutosh Saxena,et al.  A Fast Data Collection and Augmentation Procedure for Object Recognition , 2008, AAAI.

[16]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[17]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[18]  Mark J. F. Gales,et al.  Support vector machines for noise robust ASR , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[20]  Jiulong Shan,et al.  Search by voice in Mandarin Chinese , 2010, INTERSPEECH.

[21]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[22]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[24]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[25]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[26]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[27]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Lei Xie,et al.  Context-dependent deep neural networks for commercial Mandarin speech recognition applications , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[36]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[38]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[39]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[40]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[41]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[42]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[43]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[44]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[45]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[46]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[47]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[48]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[49]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[50]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[52]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[53]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[54]  Olivier Siohan,et al.  A big data approach to acoustic model training corpus selection , 2014, INTERSPEECH.

[55]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[56]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[57]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[59]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[60]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[61]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[62]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[63]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[64]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[65]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[68]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[69]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).