Deep Learning-Based Telephony Speech Recognition in the Wild

In this paper, we explore the effectiveness of a variety of Deep Learning-based acoustic models for conversational telephony speech, specifically TDNN, bLSTM and CNN-bLSTM models. We evaluated these models on both research testsets, such as Switchboard and CallHome, as well as recordings from a realworld call-center application. Our best single system, consisting of a single CNN-bLSTM acoustic model, obtained a WER of 5.7% on the Switchboard testset, and in combination with other models a WER of 5.3% was obtained. On the CallHome testset a WER of 10.1% was achieved with model combination. On the test data collected from real-world call-centers, even with model adaptation using application specific data, the WER was significantly higher at 15.0%. We performed an error analysis on the real-world data and highlight the areas where speech recognition still has challenges.

[1]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[3]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[4]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[5]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Meng Cai,et al.  Variance regularization of RNNLM for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[12]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[13]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[14]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Brian Roark,et al.  The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.

[16]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[18]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[19]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[22]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..