Densely Connected Networks for Conversational Speech Recognition

In this paper we show how we have achieved the state-of-theart performance on the industry-standard NIST 2000 Hub5 English evaluation set. We propose densely connected LSTMs (namely, dense LSTMs), inspired by the densely connected convolutional neural networks recently introduced for image classification tasks. It is shown that the proposed dense LSTMs would provide more reliable performance as compared to the conventional, residual LSTMs as more LSTM layers are stacked in neural networks. With RNN-LM rescoring and lattice combination on the 5 systems (including 2 dense LSTM based systems) trained across three different phone sets, Capio’s conversational speech recognition system has obtained 5.0% and 9.1% on Switchboard and CallHome, respectively.

[1]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[2]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[4]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[6]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[7]  Rita C Simpson-Vlach,et al.  The MICASE Handbook: A Resource for Users of the Michigan Corpus of Academic Spoken English , 2006 .

[8]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[9]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[10]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[12]  Petr Motlícek,et al.  Towards utterance-based neural network adaptation in acoustic modeling , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Kyu J. Han,et al.  Deep Learning-Based Telephony Speech Recognition in the Wild , 2017, INTERSPEECH.

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[16]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[17]  Liang Lu Sequence training and adaptation of highway deep neural networks , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[18]  H. Nesi,et al.  Research in progress, The British Academic Spoken English (BASE) Corpus Project , 2001 .

[19]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[20]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Jonathan G. Fiscus,et al.  2000 NIST EVALUATION OF CONVERSATIONAL SPEECH RECOGNITION OVER THE TELEPHONE: ENGLISH AND MANDAR IN PERFORMANCE RESULTS , 2000 .

[23]  Bhuvana Ramabhadran,et al.  Language modeling with highway LSTM , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Tara N. Sainath,et al.  Highway-LSTM and Recurrent Highway Networks for Speech Recognition , 2017, INTERSPEECH.

[26]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[27]  Meng Cai,et al.  Variance regularization of RNNLM for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[29]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yi Yang,et al.  An improved residual LSTM architecture for acoustic modeling , 2017, 2017 2nd International Conference on Computer and Communication Systems (ICCCS).

[32]  Andreas Stolcke,et al.  Comparing Human and Machine Errors in Conversational Speech Transcription , 2017, INTERSPEECH.

[33]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.