State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions

Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model architecture, namely multi-stream self-attention, to address the issue thus make the self-attention mechanism more effective for speech recognition. The proposed model architecture consists of parallel streams of self-attention encoders, and each stream has layers of 1D convolutions with dilated kernels whose dilation rates are unique given stream, followed by a self-attention layer. The self-attention mechanism in each stream pays attention to only one resolution of input speech frames and the attentive computation can be more efficient. In a later stage, outputs from all the streams are concatenated then linearly projected to the final embedding. By stacking the proposed multi-stream self-attention encoder blocks and rescoring the resultant lattices with neural network language models, we achieve the word error rate of 2.2% on the test-clean dataset of the LibriSpeech corpus, the best number reported thus far on the dataset.

[1]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Jodi Kearns,et al.  LibriVox: Free Public Domain Audiobooks , 2014 .

[4]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[5]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  Xuerui Yang,et al.  A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition , 2018, ArXiv.

[7]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[8]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[9]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[13]  Ke Li,et al.  A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[20]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[21]  Ian McGraw,et al.  On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Sankaran Panchapagesan,et al.  Model Compression Applied to Small-Footprint Keyword Spotting , 2016, INTERSPEECH.

[23]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[24]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[25]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[26]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[27]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Ronan Collobert,et al.  Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions , 2019, INTERSPEECH.

[30]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[35]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[36]  Zhiheng Huang,et al.  Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[38]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).