Combining information extraction and human computing for crowdsourced knowledge acquisition

Automatic information extraction (IE) enables the construction of very large knowledge bases (KBs), with relational facts on millions of entities from text corpora and Web sources. However, such KBs contain errors and they are far from being complete. This motivates the need for exploiting human intelligence and knowledge using crowd-based human computing (HC) for assessing the validity of facts and for gathering additional knowledge. This paper presents a novel system architecture, called Higgins, which shows how to effectively integrate an IE engine and a HC engine. Higgins generates game questions where players choose or fill in missing relations for subject-relation-object triples. For generating multiple-choice answer candidates, we have constructed a large dictionary of entity names and relational phrases, and have developed specifically designed statistical language models for phrase relatedness. To this end, we combine semantic resources like WordNet, ConceptNet, and others with statistics derived from a large Web corpus. We demonstrate the effectiveness of Higgins for knowledge acquisition by crowdsourced gathering of relationships between characters in narrative descriptions of movies and books.

[1]  Elena Paslaru Bontas Simperl,et al.  CrowdMap: Crowdsourcing Ontology Alignment with Microtasks , 2012, SEMWEB.

[2]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[3]  Danushka Bollegala,et al.  Relational duality: unsupervised extraction of semantic relations between entities on the web , 2010, WWW '10.

[4]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[5]  Tim Furche,et al.  DIADEM: domain-centric, intelligent, automated data extraction methodology , 2012, WWW.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Peter Triantafillou,et al.  Crowdsourcing Taxonomies , 2012, ESWC.

[8]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[9]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[10]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[11]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[12]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[13]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[14]  Rahul Gupta,et al.  Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited , 2011, WSDM '11.

[15]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[16]  Luis von Ahn,et al.  Human computation , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[17]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[18]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[19]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[20]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[21]  Zornitsa Kozareva,et al.  Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns , 2010, ACL.

[22]  Xiaojin Zhu,et al.  Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Catherine Havasi,et al.  Using Verbosity: Common Sense Data from Games with a Purpose , 2010, FLAIRS Conference.

[24]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[25]  Gerhard Weikum,et al.  HIGGINS: knowledge acquisition meets the crowds , 2013, WWW '13 Companion.

[26]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[27]  Christopher Ré,et al.  Big Data versus the Crowd: Looking for Relationships in All the Right Places , 2012, ACL.

[28]  Jennifer Widom,et al.  Deco: declarative crowdsourcing , 2012, CIKM.

[29]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[30]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[31]  Panagiotis G. Ipeirotis,et al.  Managing crowdsourced human computation: a tutorial , 2011, WWW.

[32]  Marius Pasca Ranking Class Labels Using Query Sessions , 2011, ACL.

[33]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[34]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[35]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[36]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[37]  Vikas Kumar,et al.  CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones , 2010, MobiSys '10.

[38]  Kevin Chen-Chuan Chang,et al.  Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality , 2011, WSDM '11.

[39]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[40]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[41]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[42]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[43]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[44]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[45]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[46]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[47]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[48]  Manuel Blum,et al.  Verbosity: a game for collecting common-sense facts , 2006, CHI.

[49]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[50]  Aditya G. Parameswaran,et al.  Answering Queries using Humans, Algorithms and Databases , 2011, CIDR.

[51]  S. Sathiya Keerthi,et al.  Automatic web-scale information extraction , 2012, SIGMOD Conference.

[52]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[53]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[54]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.