Towards a language-independent solution: Knowledge base completion by searching the Web and deriving language pattern

Abstract Knowledge bases (KBs) such as Freebase and Yago are rather incomplete, and the situation is more serious in non-English KBs, such as Chinese KBs. In this paper, we present a language-independent framework to tackle the slot-filling task by searching the Web with high-precision queries, and deriving lightweight extraction patterns. The patterns are based on string matching, and since they make no use of complex NLP resources, which may be unavailable in some languages, they are very language-independent. We use a traditional bootstrapping approach for extraction, but also use a novel approach to suppress the noise associated with distant supervision: in particular, we use a pseudo-testing method to validate the patterns derived from different sentences. Experiments show that our framework achieves very encouraging results.

[1]  Lidong Bing,et al.  Distant IE by Bootstrapping Using Lists and Document Structure , 2016, AAAI.

[2]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[3]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[4]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[5]  Lidong Bing,et al.  Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning , 2013, WSDM.

[6]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[7]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[8]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[9]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[10]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[11]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[12]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[13]  Heng Ji,et al.  Knowledge Base Population: Successful Approaches and Challenges , 2011, ACL.

[14]  Estevam R. Hruschka,et al.  Discovering Relations between Noun Categories , 2011, EMNLP.

[15]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[16]  Lidong Bing,et al.  Improving Distant Supervision for Information Extraction Using Label Propagation Through Lists , 2015, EMNLP.

[17]  Hans Uszkoreit,et al.  Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web , 2012, International Semantic Web Conference.

[18]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[19]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[20]  Rahul Gupta,et al.  Knowledge base completion via search-based question answering , 2014, WWW.

[21]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[22]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[23]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[24]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[25]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[26]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[27]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[28]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[29]  Gerhard Weikum,et al.  Discovering emerging entities with ambiguous names , 2014, WWW.

[30]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[31]  William W. Cohen,et al.  Character-level Analysis of Semi-Structured Documents for Set Expansion , 2009, EMNLP.

[32]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[33]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[34]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[35]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[36]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[37]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[38]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[39]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[40]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[41]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.