Dynamic Temporal Residual Learning for Speech Recognition

Long short-term memory (LSTM) networks have been widely used in automatic speech recognition (ASR). This paper proposes a novel dynamic temporal residual learning mechanism for LSTM networks to better explore temporal dependencies in sequential data. The temporal residual learning mechanism is implemented by applying shortcut connections with dynamic weights to temporally adjacent LSTM outputs. Two types of dynamic weight generation methods are proposed: using a secondary network and using a random weight generator. Experimental results on Wall Street Journal (WSJ) speech recognition dataset reveal that our proposed methods have surpassed the baseline LSTM network.

[1]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2]  Liangrui Peng,et al.  Deep Network with Pixel-Level Rectification and Robust Training for Handwriting Recognition , 2019, SN Computer Science.

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Liang Lu,et al.  On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[6]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[11]  Shuang Xu,et al.  Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling , 2016, INTERSPEECH.

[12]  Louis-Philippe Morency,et al.  Temporal Attention-Gated Model for Robust Sequence Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xuanjing Huang,et al.  Long Short-Term Memory with Dynamic Skip Connections , 2018, AAAI.

[14]  Philip C. Woodland,et al.  High Order Recurrent Neural Networks for Acoustic Modelling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Michael T. Johnson,et al.  Dynamic temporal residual network for sequence modeling , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[16]  Izhak Shafran,et al.  Context dependent phone models for LSTM RNN acoustic modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[18]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.