ConfNet2Seq: Full Length Answer Generation from Spoken Questions

Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces , such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.

[1]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Jean-Michel Renders,et al.  Modeling ASR Ambiguity for Dialogue State Tracking Using Word Confusion Networks , 2020, ArXiv.

[4]  Shang-Ming Wang,et al.  ODSQA: Open-Domain Spoken Question Answering Dataset , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[7]  Hung-yi Lee,et al.  Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.

[8]  Ryuichiro Higashinaka,et al.  Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.

[10]  Jun Zhao,et al.  Curriculum Learning for Natural Answer Generation , 2018, IJCAI.

[11]  Ebru Arisoy,et al.  Question Answering for Spoken Lecture Processing , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[13]  Ngoc Thang Vu,et al.  Encoding Word Confusion Networks with Recurrent Neural Networks for Dialog State Tracking , 2017, SCNLP@EMNLP 2017.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Manish Shrivastava,et al.  Answering Naturally: Factoid to Full length Answer Generation , 2019, EMNLP.