Contextual ASR Adaptation for Conversational Agents

Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents. These ASR systems should provide a high accuracy under a variety of speaking styles, domains, vocabulary and argots. In this paper, we present a DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal, context-dependent set of LM interpolation weights. We show that this framework for contextual adaptation provides accuracy improvements under different possible mixture LM partitions that are relevant for both (1) Goal-oriented conversational agents where it’s natural to partition the data by the requested application and for (2) Non-goal oriented conversational agents where the data can be partitioned using topic labels that come from predictions of a topic classifier. We obtain a relative WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass decoding framework, over an unadapted model. We also show up to a 15% relative improvement in recognizing named entities which is of significant value for conversational ASR systems.

[1]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Angeliki Metallinou,et al.  Topic-based Evaluation for Conversational Bots , 2018, ArXiv.

[4]  Johan Schalkwyk,et al.  On-demand language model interpolation for mobile speech input , 2010, INTERSPEECH.

[5]  Tanja Schultz,et al.  Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Fadi Biadsy,et al.  On-the-fly Topic Adaptation for YouTube Video Transcription , 2012, INTERSPEECH.

[9]  Karthik Visweswariah,et al.  Language models conditioned on dialog state , 2001, INTERSPEECH.

[10]  Hermann Ney,et al.  A COMPARISON OF DIALOGUE-STATE DEPENDENT LANGUAGE MODELS , 2007 .

[11]  Joseph Weizenbaum,et al.  and Machine , 1977 .

[12]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Brian Roark,et al.  Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[14]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[15]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[16]  Wei Xu,et al.  Language modeling for dialog system , 2000, INTERSPEECH.

[17]  Jianfeng Gao,et al.  A Comparative Study on Language Model Adaptation Using New Evaluation Metrics , 2005 .

[18]  Jean-Luc Gauvain,et al.  LANGUAGE MODEL ADAPTATION FOR BROADCAST NEWS TRANSCRIPTION , 2001 .

[19]  Reinhard Kneser,et al.  On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.