Entity resolution for noisy ASR transcripts

Large vocabulary domain-agnostic Automatic Speech Recognition (ASR) systems often mistranscribe domain-specific words and phrases. Since these generic ASR systems are the first component of most voice assistants in production, building Natural Language Understanding (NLU) systems that are robust to these errors can be a challenging task. In this paper, we focus on handling ASR errors in named entities, specifically person names, for a voice-based collaboration assistant. We demonstrate an effective method for resolving person names that are mistranscribed by black-box ASR systems, using character and phoneme-based information retrieval techniques and contextual information, which improves accuracy by 40.8% on our production system. We provide a live interactive demo to further illustrate the nuances of this problem and the effectiveness of our solution.

[1]  Yifan Gong,et al.  Domain and Speaker Adaptation for Cortana Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Paul Deléglise,et al.  Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions , 2014, Comput. Speech Lang..

[3]  Youssef Bassil,et al.  ASR Context-Sensitive Error Correction Based on Microsoft N-Gram Dataset , 2012, ArXiv.

[4]  David R. Traum,et al.  A reranking approach for recognition and classification of speech input in conversational dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[5]  Mari Ostendorf,et al.  Learning phrase patterns for ASR name error detection using semantic similarity , 2015, INTERSPEECH.

[6]  Bhuvana Ramabhadran,et al.  Innovative approaches for large vocabulary name recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Raymond J. Mooney,et al.  Improving Black-box Speech Recognition using Semantic Parsing , 2017, IJCNLP 2017.

[9]  Mari Ostendorf,et al.  Open-Domain Name Error Detection using a Multi-Task RNN , 2015, EMNLP.

[10]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[11]  Pascale Fung,et al.  Using N-best lists for Named Entity Recognition from Chinese Speech , 2004, NAACL.

[12]  Lucien Carroll,et al.  Developing Production-Level Conversational Interfaces with Shallow Semantic Parsing , 2018, EMNLP.

[13]  Walter Daelemans,et al.  Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[14]  William M. Campbell,et al.  Cross-Domain Entity Resolution in Social Media , 2016, ArXiv.

[15]  Timothy J. Hazen,et al.  Recognition Confidence Scoring for Use in Speech Understanding Systems , 2000 .

[16]  Wei Chen,et al.  Active error detection and resolution for speech-to-speech translation , 2012, IWSLT.

[17]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[18]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .