Exploring Model Units and Training Strategies for End-to-End Speech Recognition

In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.

[1]  Yongqiang Wang,et al.  End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[6]  Xiangang Li,et al.  Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling , 2017, ICML.

[7]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[8]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[9]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[11]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[12]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[13]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[16]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Bhuvana Ramabhadran,et al.  Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[20]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[21]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Jun Wang,et al.  Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[24]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[26]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[27]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[28]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[30]  Patrick Nguyen,et al.  Model Unit Exploration for Sequence-to-Sequence Speech Recognition , 2019, ArXiv.

[31]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[32]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[33]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[37]  Chengzhu Yu,et al.  A Multistage Training Framework for Acoustic-to-Word Model , 2018, INTERSPEECH.

[38]  Jun Wang,et al.  Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).