Light Gated Recurrent Units for Speech Recognition

A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.

[1]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[2]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Gerhard Schmidt,et al.  Speech and Audio Processing in Adverse Environments , 2008 .

[8]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[9]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Sebastian Stüker,et al.  Overview of the IWSLT 2011 evaluation campaign , 2011, IWSLT.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Arianna Bisazza,et al.  FBK@IWSLT 2011 , 2011, IWSLT.

[14]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[15]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[16]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[21]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[24]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[26]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[27]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[28]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[29]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[30]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[31]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[32]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[35]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Maurizio Omologo,et al.  On the selection of the impulse responses for distant-speech recognition based on contaminated speech training , 2014, INTERSPEECH.

[37]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[41]  Maurizio Omologo,et al.  Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[42]  Tatsuya Kawahara,et al.  Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature , 2015, EURASIP J. Adv. Signal Process..

[43]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[44]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[45]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[47]  Inchul Song,et al.  RNNDROP: A novel dropout for RNNS in ASR , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[48]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[49]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[50]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[51]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[52]  Yongqiang Wang,et al.  Small-footprint high-performance deep neural network-based speech recognition using split-VQ , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Roland Maas,et al.  Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[56]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[57]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[58]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[59]  Maurizio Omologo,et al.  Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[60]  Yoshua Bengio,et al.  Batch-normalized joint training for DNN-based distant speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[61]  Ulrike Goldschmidt Speech And Audio Processing In Adverse Environments , 2016 .

[62]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[63]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[64]  Jianxin Wu,et al.  Minimal gated unit for recurrent neural networks , 2016, International Journal of Automation and Computing.

[65]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Hermann Ney,et al.  The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation , 2016 .

[67]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[68]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Liang Lu,et al.  Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition , 2015, INTERSPEECH.

[70]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[72]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[73]  Yoshua Bengio,et al.  Improving Speech Recognition by Revising Gated Recurrent Units , 2017, INTERSPEECH.

[74]  Yoshua Bengio,et al.  A network of deep neural networks for Distant Speech Recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).