Towards an ASR Error Robust Spoken Language Understanding System

A modern Spoken Language Understanding (SLU) system usually contains two sub-systems, Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU), where ASR transforms voice signal to text form and NLU provides intent classification and slot filling from the text. In practice, such decoupled ASR/NLU design facilitates fast model iteration for both components. However, this makes downstream NLU susceptible to errors from the upstream ASR, causing significant performance degradation. Therefore, dealing with such errors is a major opportunity to improve overall SLU model performance. In this work, we first propose a general evaluation criterion that requires an ASR error robust model to perform well on both transcription and ASR hypothesis. Then robustness training techniques for both classification task and NER task are introduced. Experimental results on two datasets show that our proposed approaches improve model robustness to ASR errors for both tasks.

[1]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[2]  Yun-Nung Chen,et al.  Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.

[4]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[5]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yuji Matsumoto,et al.  Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[7]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[8]  Jia Liu,et al.  Using word confusion networks for slot filling in spoken language understanding , 2015, INTERSPEECH.

[9]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[10]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[11]  Deepak Kumar,et al.  Skill Squatting Attacks on Amazon Alexa , 2018, USENIX Security Symposium.

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  Matthew Henderson,et al.  Word-Based Dialog State Tracking with Recurrent Neural Networks , 2014, SIGDIAL Conference.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Alex Acero,et al.  Semantic Frame‐Based Spoken Language Understanding , 2011 .

[16]  Yannick Estève,et al.  Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech , 2019, INTERSPEECH.

[17]  Sang-goo Lee,et al.  Data Augmentation for Spoken Language Understanding via Joint Variational Generation , 2018, AAAI.

[18]  Rebecca Jonson DIALOGUE CONTEXT-BASED RE-RANKING OF ASR HYPOTHESES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[19]  Fuchun Peng,et al.  Search results based N-best hypothesis rescoring with maximum entropy classification , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Maria-Esther Vidal,et al.  Use of Knowledge Graph in Rescoring the N-Best List in Automatic Speech Recognition , 2017, ArXiv.

[21]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[22]  Panayiotis G. Georgiou,et al.  Spoken Language Intent Detection using Confusion2Vec , 2019, INTERSPEECH.

[23]  Gökhan Tür,et al.  What is left to be understood in ATIS? , 2010, 2010 IEEE Spoken Language Technology Workshop.

[24]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[25]  Murat Saraclar,et al.  Discriminative reranking of ASR hypotheses with morpholexical and N-best-list features , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[27]  Matthew Henderson,et al.  Deep Neural Network Approach for the Dialog State Tracking Challenge , 2013, SIGDIAL Conference.

[28]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[29]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[30]  David R. Traum,et al.  A reranking approach for recognition and classification of speech input in conversational dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[31]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[33]  Tomohiro Nakatani,et al.  Improved Deep Duel Model for Rescoring N-Best Speech Recognition List Using Backward LSTMLM and Ensemble Encoders , 2019, INTERSPEECH.

[34]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[35]  Jaime G. Carbonell,et al.  A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers , 2019, EMNLP.