论文信息 - Contextual ASR Adaptation for Conversational Agents - 字舞流文

Contextual ASR Adaptation for Conversational Agents

Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents. These ASR systems should provide a high accuracy under a variety of speaking styles, domains, vocabulary and argots. In this paper, we present a DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal, context-dependent set of LM interpolation weights. We show that this framework for contextual adaptation provides accuracy improvements under different possible mixture LM partitions that are relevant for both (1) Goal-oriented conversational agents where it’s natural to partition the data by the requested application and for (2) Non-goal oriented conversational agents where the data can be partitioned using topic labels that come from predictions of a topic classifier. We obtain a relative WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass decoding framework, over an unadapted model. We also show up to a 15% relative improvement in recognizing named entities which is of significant value for conversational ASR systems.

Ariya Rastrow | Angeliki Metallinou | Linda Liu | Ankur Gandhe | Behnam Hedayatnia | Chandra Khatri | Anirudh Raju | Anu Venkatesh

[1] J.R. Bellegarda,et al. Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[2] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3] Angeliki Metallinou,et al. Topic-based Evaluation for Conversational Bots , 2018, ArXiv.

[4] Johan Schalkwyk,et al. On-demand language model interpolation for mobile speech input , 2010, INTERSPEECH.

[5] Tanja Schultz,et al. Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.

[6] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7] Anthony J. Robinson,et al. Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] Fadi Biadsy,et al. On-the-fly Topic Adaptation for YouTube Video Transcription , 2012, INTERSPEECH.

[9] Karthik Visweswariah,et al. Language models conditioned on dialog state , 2001, INTERSPEECH.

[10] Hermann Ney,et al. A COMPARISON OF DIALOGUE-STATE DEPENDENT LANGUAGE MODELS , 2007 .

[11] Joseph Weizenbaum,et al. and Machine , 1977 .

[12] Geoffrey Zweig,et al. Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13] Brian Roark,et al. Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[14] Fernando Pereira,et al. Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[15] Jianfeng Gao,et al. Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[16] Wei Xu,et al. Language modeling for dialog system , 2000, INTERSPEECH.

[17] Jianfeng Gao,et al. A Comparative Study on Language Model Adaptation Using New Evaluation Metrics , 2005 .

[18] Jean-Luc Gauvain,et al. LANGUAGE MODEL ADAPTATION FOR BROADCAST NEWS TRANSCRIPTION , 2001 .

[19] Reinhard Kneser,et al. On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20] Dietrich Klakow,et al. Language model adaptation using dynamic marginals , 1997, EUROSPEECH.