Learning Dense Representations of Phrases at Scale

Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on their sparse representations while still underperforming retriever-reader approaches. In this work, we show for the first time that we can learn dense phrase representations alone that achieve much stronger performance in open-domain QA. Our approach includes (1) learning query-agnostic phrase representations via question generation and distillation; (2) novel negative-sampling methods for global normalization; (3) query-side fine-tuning for transfer learning. On five popular QA datasets, our model DensePhrases improves previous phrase retrieval models by 15%–25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks.1

[1]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[2]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, ArXiv.

[3]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[4]  Niranjan Balasubramanian,et al.  DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering , 2020, ACL.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[7]  Danqi Chen,et al.  A Discrete Hard EM Approach for Weakly Supervised Question Answering , 2019, EMNLP.

[8]  Charlotte Pasqual,et al.  Delaying Interaction Layers in Transformer-based Encoders for Efficient Open Domain Question Answering , 2020, ArXiv.

[9]  Petr Baudis,et al.  Modeling of the Question Answering Task in the YodaQA System , 2015, CLEF.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[12]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[13]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[14]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[15]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[17]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[18]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[19]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[20]  Colin Raffel,et al.  How Much Knowledge Can You Pack Into the Parameters of a Language Model? , 2020, EMNLP.

[21]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[22]  Jaewoo Kang,et al.  Contextualized Sparse Representations for Real-Time Open-Domain Question Answering , 2020, ACL.

[23]  Ali Farhadi,et al.  Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[24]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[25]  Ali Farhadi,et al.  Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index , 2019, ACL.

[26]  Richard Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[27]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[28]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[29]  Matei Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[30]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Kenton Lee,et al.  Learning Recurrent Span Representations for Extractive Question Answering , 2016, ArXiv.

[33]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[34]  Mark Andrew Greenwood,et al.  Open-domain question answering , 2005 .

[35]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[36]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[37]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[38]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[39]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.