Improving RNN transducer with normalized jointer network

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark with CER 6.15\% and 5.37\% respectively. We further compare our RNN-T system with our well trained commercial hybrid system on 30000-hour-industry audio data and get 9\% relative improvement without pre-training or external language model.

[1]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[3]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[5]  Xiao Chen,et al.  Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition , 2020, INTERSPEECH.

[6]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[7]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[9]  Jiangyan Yi,et al.  Self-Attention Transducers for End-to-End Speech Recognition , 2019, INTERSPEECH.

[10]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[11]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Tara N. Sainath,et al.  A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[16]  Ian McLoughlin,et al.  SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition , 2020, INTERSPEECH.

[17]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Shiliang Zhang,et al.  Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning , 2018, INTERSPEECH.

[19]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[20]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Lei Xie,et al.  Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition , 2020, INTERSPEECH.

[23]  Yifan Gong,et al.  Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Shiliang Zhang,et al.  Deep-FSMN for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[26]  Khe Chai Sim,et al.  An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models , 2019, INTERSPEECH.

[27]  Rohit Prabhavalkar,et al.  On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.