Machine Comprehension of Spoken Content: TOEFL Listening Test and Spoken SQuAD

A user can scan through a text easily, but it is not the case for spoken content, because they cannot be directly displayed on-screen. As a result, accessing large collections of spoken content is much more difficult and time-consuming than doing so for the text content. It would therefore be helpful to develop machines that understand spoken content. In this paper, we propose two new tasks for machine comprehension of spoken content. The first is a listening comprehension test for TOEFL, a challenging academic English examination for English learners who are not the native English speakers. We show that the proposed model outperforms the naive approaches and other neural network based models by exploiting the hierarchical structures of natural languages and the selective power of attention mechanism. For the second listening comprehension task – spoken SQuAD – we find that speech recognition errors severely impair machine comprehension; we propose the use of subword units to mitigate the impact of these errors.

[1]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[2]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[3]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[4]  Peter Clark,et al.  Learning Knowledge Graphs for Question Answering through Conversational Dialog , 2015, NAACL.

[5]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[6]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[7]  Bohyung Han,et al.  MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[9]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[10]  Ting Liu,et al.  Attention-over-Attention Neural Networks for Reading Comprehension , 2016, ACL.

[11]  Kam-Fai Wong,et al.  Towards Neural Network-based Reasoning , 2015, ArXiv.

[12]  Yelong Shen,et al.  FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension , 2017, ICLR.

[13]  Lin-Shan Lee,et al.  Spoken question answering using tree-structured conditional random fields and two-layer random walk , 2014, INTERSPEECH.

[14]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[16]  Oren Etzioni,et al.  Open question answering over curated and extracted knowledge bases , 2014, KDD.

[17]  Ilyas Cicekli,et al.  A Factoid Question Answering System Using Answer Pattern Matching , 2013, IJCNLP.

[18]  Tie-Yan Liu,et al.  Knowledge-Powered Deep Learning for Word Embedding , 2014, ECML/PKDD.

[19]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[20]  Jason Weston,et al.  Weakly Supervised Memory Networks , 2015, ArXiv.

[21]  Hung-yi Lee,et al.  Mitigating the Impact of Speech Recognition Errors on Spoken Question Answering by Adversarial Domain Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[23]  Ming Zhou,et al.  Reinforced Mnemonic Reader for Machine Reading Comprehension , 2017, IJCAI.

[24]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[27]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[29]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[30]  Comas Umbert,et al.  Factoid question answering for spoken documents , 2012 .

[31]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[32]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[33]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[34]  Lin-Shan Lee,et al.  Hierarchical attention model for improved machine comprehension of spoken content , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[35]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[36]  Lin-Shan Lee,et al.  Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine , 2016, INTERSPEECH.

[37]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[38]  Hung-yi Lee,et al.  Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , 2018, INTERSPEECH.

[39]  Shang-Ming Wang,et al.  ODSQA: Open-Domain Spoken Question Answering Dataset , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[40]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[41]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[42]  Mario Fritz,et al.  Ask Your Neurons: A Deep Learning Approach to Visual Question Answering , 2016, International Journal of Computer Vision.

[43]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[44]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[45]  Lori Lamel,et al.  Overview of QAST 2008 , 2008, CLEF.

[46]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[47]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[48]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[49]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[50]  Lluís Màrquez i Villodre,et al.  Sibyl, a factoid question-answering system for spoken documents , 2012, TOIS.

[51]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[52]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[53]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[54]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[55]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[56]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[59]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.