Fast and Robust Unsupervised Contextual Biasing for Speech Recognition

Automatic speech recognition (ASR) system is becoming a ubiquitous technology. Although its accuracy is closing the gap with that of human level under certain settings, one area that can further improve is to incorporate user-specific information or context to bias its prediction. A common framework is to dynamically construct a small language model from the provided contextual mini corpus and interpolate its score with the main language model during the decoding process. Here we propose an alternative approach that does not entail explicit contextual language model. Instead, we derive the bias score for every word in the system vocabulary from the training corpus. The method is unique in that 1) it does not require meta-data or class-label annotation for the context or the training corpus. 2) The bias score is proportional to the word's log-probability, thus not only would it bias the provided context, but also robust against irrelevant context (e.g. user mis-specified or in case where it is hard to quantify a tight scope). 3) The bias score for the entire vocabulary is pre-determined during the training stage, thereby eliminating computationally expensive language model construction during inference. We show significant improvement in recognition accuracy when the relevant context is available. Additionally, we also demonstrate that the proposed method exhibits high tolerance to false-triggering errors in the presence of irrelevant context.

[1]  Richard Socher,et al.  Improving End-to-End Speech Recognition with Policy Learning , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[3]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[4]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[5]  Cyril Allauzen,et al.  Improved recognition of contact names in voice commands , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[7]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Karl Stratos,et al.  A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language , 2014, UAI.

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Brian Roark,et al.  Improved name recognition with meta-data dependent name networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Ian Williams,et al.  Voice search language model adaptation using contextual information , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[13]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[14]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[16]  Lucy Vasserman,et al.  Contextual language model adaptation using dynamic classes , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[17]  David Suendermann-Oeft,et al.  Medical Speech Recognition: Reaching Parity with Humans , 2017, SPECOM.

[18]  Brian Roark,et al.  Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[19]  Yongqiang Wang,et al.  End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Brian Roark,et al.  Composition-based on-the-fly rescoring for salient n-gram biasing , 2015, INTERSPEECH.