Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors

Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users’ audio signals to text data through automatic speech recognition (ASR) and feed the text to downstream dialog models for natural language understanding and response generation. The ASR output is error-prone; however, the downstream dialog models are often trained on error-free text data, making them sensitive to ASR errors during inference time. To bridge the gap and make dialog models more robust to ASR errors, we leverage an ASR error simulator to inject noise into the error-free text data, and subsequently train the dialog models with the augmented data. Compared to other approaches for handling ASR errors, such as using ASR lattice or end-to-end methods, our data augmentation approach does not require any modification to the ASR or downstream dialog models; our approach also does not introduce any additional latency during inference time. We perform extensive experiments on benchmark data and show that our approach improves the performance of downstream dialog models in the presence of ASR errors, and it is particularly effective in the low-resource situations where there are constraints on model size or the training data is scarce.

[1]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[2]  Tomohiro Tanaka,et al.  Neural Error Corrective Language Models for Automatic Speech Recognition , 2018, INTERSPEECH.

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Arash Einolghozati,et al.  Improving Robustness of Task Oriented Dialog Systems , 2019, ArXiv.

[5]  Gokhan Tur,et al.  Joint Contextual Modeling for ASR Correction and Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Raphael Schumann,et al.  Incorporating ASR Errors with Attention-Based, Jointly Trained RNN for Intent Detection and Slot Filling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Panayiotis G. Georgiou,et al.  Confusion2Vec: towards enriching vector space word representations with representational ambiguities , 2018, PeerJ Comput. Sci..

[8]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[9]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[11]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[12]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Paul Deléglise,et al.  Acoustic Word Embeddings for ASR Error Detection , 2016, INTERSPEECH.

[14]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[15]  Steve J. Young,et al.  Error simulation for training statistical dialogue systems , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[16]  Panayiotis G. Georgiou,et al.  Spoken Language Intent Detection using Confusion2Vec , 2019, INTERSPEECH.

[17]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[18]  Chris Eliasmith,et al.  Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[19]  Matthew Henderson,et al.  The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[20]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.

[21]  Maryam Fazel-Zarandi,et al.  Investigation of Error Simulation Techniques for Learning Dialog Policies for Conversational Error Recovery , 2019, ArXiv.

[22]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[23]  Sam Shleifer Low Resource Text Classification with ULMFit and Backtranslation , 2019, ArXiv.

[24]  Oliver Lemon,et al.  Data Collection in a Wizard-of-Oz Experiment , 2011 .

[25]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.