KILT: a Benchmark for Knowledge Intensive Language Tasks

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[3]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[4]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[6]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[7]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[8]  Paolo Ferragina,et al.  Fast and Accurate Annotation of Short Texts with Wikipedia Pages , 2010, IEEE Software.

[9]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[10]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[11]  Mihai Surdeanu Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling , 2013, TAC.

[12]  M. Surdeanu,et al.  Overview of the English Slot Filling Track at the TAC 2014 Knowledge Base Population Evaluation , 2014 .

[13]  Fabio Petroni,et al.  CORE: Context-Aware Open Relation Extraction with Factorization Machines , 2015, EMNLP.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[16]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[17]  William W. Cohen,et al.  Quasar: Datasets for Question Answering by Search and Reading , 2017, ArXiv.

[18]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[19]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[20]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[21]  Andreas Vlachos,et al.  The Fact Extraction and VERification (FEVER) Shared Task , 2018, FEVER@EMNLP.

[22]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[23]  Zhaochen Guo,et al.  Robust named entity disambiguation with random walks , 2018, Semantic Web.

[24]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[25]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[26]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[27]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[28]  Jonathan Berant,et al.  The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[29]  ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension , 2019, ArXiv.

[30]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31]  Stefan Lee,et al.  EvalAI: Towards Better Evaluation Systems for AI Agents , 2019, ArXiv.

[32]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[33]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[34]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[35]  Haonan Chen,et al.  Combining Fact Extraction and Verification with Neural Semantic Matching Networks , 2018, AAAI.

[36]  Claire Gardent,et al.  Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs , 2019, EMNLP.

[37]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[38]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[39]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[42]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[43]  J. Weston,et al.  The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents , 2019, ACL.

[44]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[45]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46]  Luke Zettlemoyer,et al.  Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[47]  Daniel Deutch,et al.  Break It Down: A Question Understanding Benchmark , 2020, TACL.

[48]  James Thorne,et al.  Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation , 2020, ArXiv.

[49]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[50]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[51]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[52]  Armen Aghajanyan,et al.  Pre-training via Paraphrasing , 2020, NeurIPS.

[53]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[54]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.