Importance sampling for unbiased on-demand evaluation of knowledge base population

Knowledge base population (KBP) systems take in a large document corpus and extract entities and their relations. Thus far, KBP evaluation has relied on judgements on the pooled predictions of existing systems. We show that this evaluation is problematic: when a new system predicts a previously unseen relation, it is penalized even if it is correct. This leads to significant bias against new systems, which counterproductively discourages innovation in the field. Our first contribution is a new importance-sampling based evaluation which corrects for this bias by annotating a new system’s predictions on-demand via crowdsourcing. We show this eliminates bias and reduces variance using data from the 2015 TAC KBP task. Our second contribution is an implementation of our method made publicly available as an online KBP evaluation service. We pilot the service by testing diverse state-of-the-art systems on the TAC KBP 2016 corpus and obtain accurate scores in a cost effective manner.

[1]  Angli Liu,et al.  Effective Crowd Annotation for Relation Extraction , 2016, NAACL.

[2]  Sangdo Han,et al.  Exploiting knowledge base to generate responses for natural language dialog listening agents , 2015, SIGDIAL Conference.

[3]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[4]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[5]  Oren Etzioni,et al.  Open question answering over curated and extracted knowledge bases , 2014, KDD.

[6]  Mark Steedman,et al.  Large-scale Semantic Parsing without Question-Answer Pairs , 2014, TACL.

[7]  Stephanie M. Strassel,et al.  Linguistic Resources for 2013 Knowledge Base Population Evaluations , 2012 .

[8]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[9]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[10]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[11]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[12]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[13]  Siddharth Patwardhan,et al.  Structured data and inference in DeepQA , 2012, IBM J. Res. Dev..

[14]  Roberto Navigli,et al.  Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose , 2014, ACL.

[15]  Chris Callison-Burch,et al.  The Gun Violence Database: A new task and data set for NLP , 2016, EMNLP.

[16]  Heike Adel,et al.  Comparing Convolutional Neural Networks to Traditional Models for Slot Filling , 2016, NAACL.

[17]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[18]  Luke S. Zettlemoyer,et al.  Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language , 2015, EMNLP.

[19]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[20]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[21]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[22]  William Webber,et al.  Measurement in information retrieval evaluation , 2010 .

[23]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[24]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[25]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .