Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion

In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.

[1]  Amir Asif,et al.  XceptionTime: A Novel Deep Architecture based on Depthwise Separable Convolutions for Hand Gesture Classification , 2019, ArXiv.

[2]  Yan Song,et al.  Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition , 2018, INTERSPEECH.

[3]  Kyu J. Han,et al.  Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[4]  Weibin Zhang,et al.  Multi-head Monotonic Chunkwise Attention For Online Speech Recognition , 2020, ArXiv.

[5]  Steve Renals,et al.  Multi-Scale Octave Convolutions for Robust Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[7]  Ronan Collobert,et al.  Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions , 2019, INTERSPEECH.

[8]  Mei-Yuh Hwang,et al.  Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jiangyan Yi,et al.  Self-Attention Transducers for End-to-End Speech Recognition , 2019, INTERSPEECH.

[10]  Xiaofei Wang,et al.  Multi-encoder multi-resolution framework for end-to-end speech recognition , 2018, ArXiv.

[11]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[13]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[14]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Boris Ginsburg,et al.  Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Shiming Xiang,et al.  AugFPN: Improving Multi-Scale Feature Learning for Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Boris Ginsburg,et al.  Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks , 2019, ArXiv.

[22]  Paul N. Bennett,et al.  Combiner: Inductively Learning Tree Structured Attention in Transformers , 2019 .

[23]  Xiaofei Wang,et al.  Multi-Stream End-to-End Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[25]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[26]  Jiangyan Yi,et al.  Synchronous Transformers for end-to-end Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[28]  Yidong Li,et al.  Cross-Layer Feature Pyramid Network for Salient Object Detection , 2020, IEEE Transactions on Image Processing.

[29]  Shanmuganathan Raman,et al.  Depthwise-STFT Based Separable Convolutional Neural Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[31]  Dong Yu,et al.  Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Nicolas Usunier,et al.  Fully Convolutional Speech Recognition , 2018, ArXiv.