Big Data versus the Crowd: Looking for Relationships in All the Right Places

Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.

[1]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2]  James Fogarty,et al.  Amplifying community content creation with mixed initiative information extraction , 2009, CHI.

[3]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[4]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Alessandro Moschitti,et al.  Joint Distant and Direct Supervision for Relation Extraction , 2011, IJCNLP.

[7]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[8]  Marti A. Hearst Automatic Acquisition of Hyponyms , 1992 .

[9]  Valentin I. Spitkovsky,et al.  A Simple Distant Supervision Approach for the TAC-KBP Slot Filling Task , 2010, TAC.

[10]  Mark Dredze,et al.  Non-Expert Correction of Automatically Generated Relation Annotations , 2010, Mturk@HLT-NAACL.

[11]  Mihai Surdeanu,et al.  Ensemble Models for Dependency Parsing: Cheap and Good? , 2010, HLT-NAACL.

[12]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[13]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[14]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[15]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[16]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[17]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[18]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[19]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[20]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[21]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[22]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[23]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[24]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[25]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.