Weakly Supervised Pre-Training for Multi-Hop Retriever

In multi-hop QA, answering complex questions entails iterative document retrieval for finding the missing entity of the question. The main steps of this process are sub-question detection, document retrieval for the subquestion, and generation of a new query for the final document retrieval. However, building a dataset that contains complex questions with sub-questions and their corresponding documents requires costly human annotation. To address the issue, we propose a new method for weakly supervised multi-hop retriever pretraining without human efforts. Our method includes 1) a pre-training task for generating vector representations of complex questions, 2) a scalable data generation method that produces the nested structure of question and subquestion as weak supervision for pre-training, and 3) a pre-training model structure based on dense encoders. We conduct experiments to compare the performance of our pre-trained retriever with several state-of-the-art models on end-to-end multi-hop QA as well as document retrieval. The experimental results show that our pre-trained retriever is effective and also robust on limited data and computational resources.

[1]  Chang Zhou,et al.  Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[4]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[7]  Wenhan Xiong,et al.  Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval , 2020, International Conference on Learning Representations.

[8]  Daniel Deutch,et al.  Break It Down: A Question Understanding Benchmark , 2020, TACL.

[9]  Richard Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[10]  Wenhu Chen,et al.  Unsupervised Multi-hop Question Answering by Question Generation , 2021, NAACL.

[11]  William W. Cohen,et al.  PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text , 2019, EMNLP.

[12]  Kyunghyun Cho,et al.  Task-Oriented Query Reformulation with Reinforcement Learning , 2017, EMNLP.

[13]  Hannaneh Hajishirzi,et al.  Multi-hop Reading Comprehension through Question Decomposition and Rescoring , 2019, ACL.

[14]  Kyunghyun Cho,et al.  Unsupervised Question Decomposition for Question Answering , 2020, EMNLP.

[15]  Zhe Gan,et al.  Hierarchical Graph Network for Multi-hop Question Answering , 2020, EMNLP.

[16]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[17]  Zijian Wang,et al.  Answering Complex Open-domain Questions Through Iterative Query Generation , 2019, EMNLP.

[18]  Paul N. Bennett,et al.  Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention , 2020, ICLR.

[19]  Mohit Bansal,et al.  Revealing the Importance of Semantic Retrieval for Machine Reading at Scale , 2019, EMNLP.

[20]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[21]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[24]  Chengjie Sun,et al.  HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions , 2020, AAAI.