Bringing contextual information to google speech recognition

In automatic speech recognition on mobile devices, very often what a user says strongly depends on the particular context he or she is in. The n-grams relevant to the context are often not known in advance. The context can depend on, for example, particular dialog state, options presented to the user, conversation topic, location, etc. Speech recognition of sentences that include these n-grams can be challenging, as they are often not well represented in a language model (LM) or even include out-of-vocabulary (OOV) words. In this paper, we propose a solution for using contextual information to improve speech recognition accuracy. We utilize an on-the-fly rescoring mechanism to adjust the LM weights of a small set of n-grams relevant to the particular context during speech decoding. Our solution handles out of vocabulary words. It also addresses efficient combination of multiple sources of context and it even allows biasing class based language models. We show significant speech recognition accuracy improvements on several datasets, using various types of contexts, without negatively impacting the overall system. The improvements are obtained in both offline and live experiments.

[1]  Geoffrey Zweig,et al.  Anatomy of an extremely fast LVCSR decoder , 2005, INTERSPEECH.

[2]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Cyril Allauzen,et al.  Improved recognition of contact names in voice commands , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Keith B. Hall,et al.  Sequence-based class tagging for robust transcription in ASR , 2015, INTERSPEECH.

[5]  Brian Roark,et al.  Composition-based on-the-fly rescoring for salient n-gram biasing , 2015, INTERSPEECH.

[6]  References , 1971 .

[7]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[9]  Stefan Besling,et al.  Language model speaker adaptation , 1995, EUROSPEECH.

[10]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[11]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[13]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.