Uncertainty-Aware Representations for Spoken Question Answering

This paper describes a spoken question answering system that utilizes the uncertainty in automatic speech recognition (ASR) to mitigate the effect of ASR errors on question answering. Spoken question answering is typically performed by transcribing spoken con-tent with an ASR system and then applying text-based question answering methods to the ASR transcriptions. Question answering on spoken documents is more challenging than question answering on text documents since ASR transcriptions can be erroneous and this degrades the system performance. In this paper, we propose integrating confusion networks with word confidence scores into an end-to-end neural network-based question answering system that works on ASR transcriptions. Integration is performed by generating uncertainty-aware embedding representations from confusion networks. The proposed approach improves F1 score in a question answering task developed for spoken lectures by providing tighter integration of ASR and question answering.

[1]  Iryna Gurevych,et al.  A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures , 2016, LT4DH@COLING.

[2]  Hung-yi Lee,et al.  Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.

[3]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[4]  Shang-Ming Wang,et al.  ODSQA: Open-Domain Spoken Question Answering Dataset , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Hung-yi Lee,et al.  Mitigating the Impact of Speech Recognition Errors on Spoken Question Answering by Adversarial Domain Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ebru Arisoy Developing an automatic transcription and retrieval system for spoken lectures in Turkish , 2017, 2017 25th Signal Processing and Communications Applications Conference (SIU).

[9]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[10]  Ebru Arisoy,et al.  Question Answering for Spoken Lecture Processing , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  L. Besacier,et al.  ConfNet2Seq: Full Length Answer Generation from Spoken Questions , 2020, TDS.

[12]  Graham Neubig,et al.  Mitigating Noisy Inputs for Question Answering , 2019, INTERSPEECH.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[15]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[16]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[19]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[20]  Bhuvana Ramabhadran,et al.  Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech , 2012, Speech Commun..

[21]  Murat Saraclar,et al.  Resources for Turkish morphological processing , 2011, Lang. Resour. Evaluation.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Ryuichiro Higashinaka,et al.  Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[25]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[26]  Lin-Shan Lee,et al.  Hierarchical attention model for improved machine comprehension of spoken content , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[27]  Pascale Fung,et al.  Improving Spoken Question Answering Using Contextualized Word Representation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[30]  Yang Liu,et al.  Using N-Best Lists and Confusion Networks for Meeting Summarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Ming Zhou,et al.  Reinforced Mnemonic Reader for Machine Reading Comprehension , 2017, IJCAI.

[32]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[33]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Lin-Shan Lee,et al.  Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine , 2016, INTERSPEECH.

[36]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.