Twin Regularization for online speech recognition

Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Yi He,et al.  Deep LSTM for Large Vocabulary Continuous Speech Recognition , 2017, ArXiv.

[4]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Yoshua Bengio,et al.  Improving Speech Recognition by Revising Gated Recurrent Units , 2017, INTERSPEECH.

[7]  Georg Heigold,et al.  Asynchronous, online, GMM-free training of a context dependent acoustic model for speech recognition , 2014, INTERSPEECH.

[8]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Maurizio Omologo,et al.  Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[10]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[12]  Jianxin Wu,et al.  Minimal gated unit for recurrent neural networks , 2016, International Journal of Automation and Computing.

[13]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[14]  Yoshua Bengio,et al.  Variational Bi-LSTMs , 2017, ArXiv.

[15]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[16]  Sanjeev Khudanpur,et al.  Reverberation robust acoustic modeling using i-vectors with time delay neural networks , 2015, INTERSPEECH.

[17]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[18]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[19]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Inchul Song,et al.  RNNDROP: A novel dropout for RNNS in ASR , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Yoshua Bengio,et al.  MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[27]  Hermann Ney,et al.  Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models , 2016, INTERSPEECH.

[28]  Yoshua Bengio,et al.  Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[29]  Christopher Joseph Pal,et al.  Twin Networks: Matching the Future for Sequence Generation , 2017, ICLR.

[30]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[31]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[32]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Zhijie Yan,et al.  Improving latency-controlled BLSTM acoustic models for online speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Maurizio Omologo,et al.  Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[35]  Geoffrey Zweig,et al.  Deep bi-directional recurrent networks over spectral windows , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[38]  Yongqiang Wang,et al.  Small-footprint high-performance deep neural network-based speech recognition using split-VQ , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Maurizio Omologo,et al.  Automatic context window composition for distant speech recognition , 2018, Speech Commun..

[40]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[41]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[42]  Florian Metze,et al.  New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[43]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[44]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Steve Renals,et al.  Small-Footprint Highway Deep Neural Networks for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.