Improving Proper Noun Recognition in End-To-End Asr by Customization of the Mwer Loss Criterion

Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems in that a particular name may appear only rarely during training, and may have a pronunciation similar to that of a more common word. Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations and a language model that can be trained on a large text-only corpus. Past work has addressed this issue by incorporating additional training data or additional models. In this paper, we instead build on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition. Unlike past work on this problem, this method requires no new data during training or external models during inference. We see improvements ranging from 2% to 7% relative on several relevant benchmarks.

[1]  Jun Wang,et al.  Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[2]  Alexander Gutkin,et al.  Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[4]  Tara N. Sainath,et al.  Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models , 2019, INTERSPEECH.

[5]  Matt Shannon,et al.  Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[7]  Tomoharu Iwata,et al.  Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tara N. Sainath,et al.  Improving Performance of End-to-End ASR on Numeric Sequences , 2019, INTERSPEECH.

[9]  Susan L. Epstein,et al.  Phonemic Similarity Metrics to Compare Pronunciation Methods , 2011, INTERSPEECH.

[10]  Yossi Matias,et al.  Personalizing ASR for Dysarthric and Accented Speech with Limited Data , 2019, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Deep Context: End-to-end Contextual Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Paul Deléglise,et al.  Acoustics-based phonetic transcription method for proper nouns , 2010, INTERSPEECH.

[13]  Preethi Jyothi,et al.  Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning , 2018, INTERSPEECH.

[14]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[15]  Mitch Weintraub,et al.  Learning name pronunciations in automatic speech recognition systems , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[16]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[18]  Tara N. Sainath,et al.  Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Jonathan Le Roux,et al.  Cycle-consistency Training for End-to-end Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[22]  Keith B. Hall,et al.  Sequence-based class tagging for robust transcription in ASR , 2015, INTERSPEECH.

[23]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[24]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tanja Schultz,et al.  Multilingual and Crosslingual Speech Recognition , 1998 .

[27]  Ngoc Thang Vu,et al.  Improving ASR performance on non-native speech using multilingual and crosslingual information , 2014, INTERSPEECH.

[28]  Tara N. Sainath,et al.  A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[29]  Tara N. Sainath,et al.  Contextual Speech Recognition with Difficult Negative Training Examples , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).