Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition

In this paper, we propose an utterance invariant training (UIT) specifically designed to improve the performance of a two-pass end-to-end hybrid ASR. Our proposed hybrid ASR solution uses a shared encoder with a monotonic chunkwise attention (MoChA) decoder for streaming capabilities, while using a lowlatency bidirectional full-attention (BFA) decoder for enhancing the overall ASR accuracy. A modified sequence summary network (SSN) based utterance invariant training is used to suit the two-pass model architecture. The input feature stream selfconditioned by scaling and shifting with its own sequence summary is used as a concatenative conditioning on the bidirectional encoder layers sitting on top of the shared encoder. In effect, the proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive. Experimental results show that the proposed approach shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.

[1]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[3]  Sree Hari Krishnan Parthasarathi,et al.  fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[4]  Puming Zhan,et al.  Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR , 2019, INTERSPEECH.

[5]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yu Zhang,et al.  Speaker adaptation using the i-vector technique for bottleneck features , 2015, INTERSPEECH.

[9]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Shinji Watanabe,et al.  Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.

[12]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[13]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[14]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[16]  Jonathan Le Roux,et al.  Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sathish Reddy Indurthi,et al.  Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ankur Kumar,et al.  Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios , 2020, INTERSPEECH.

[19]  Daehyun Kim,et al.  Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Dhananjaya N. Gowda,et al.  Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Yashesh Gaur,et al.  Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[22]  Yonghong Yan,et al.  Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[23]  Shinji Watanabe,et al.  Sequence summarizing neural network for speaker adaptation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Dhananjaya N. Gowda,et al.  Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System , 2019, INTERSPEECH.

[25]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[27]  Young-Yoon Lee,et al.  Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing , 2020, INTERSPEECH.

[28]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[29]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Dhananjaya N. Gowda,et al.  End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Ankur Kumar,et al.  Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[35]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[36]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[37]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Chanwoo Kim,et al.  Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition , 2020, INTERSPEECH.

[39]  Yifan Gong,et al.  Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).