Information Extraction from Wikipedia Using Pattern Learning

In this paper we present solutions for the crucial task of extracting structured information from massive free-text resources, such as Wikipedia, for the sake of semantic databases serving upcoming Semantic Web technologies. We demonstrate both a verb frame-based approach using deep natural language processing techniques with extraction patterns developed by human knowledge experts and machine learning methods using shallow linguistic processing. We also propose a method for learning verb frame-based extraction patterns automatically from labeled data. We show that labeled training data can be produced with only minimal human effort by utilizing existing semantic resources and the special characteristics of Wikipedia. Custom solutions for named entity recognition are also possible in this scenario. We present evaluation and comparison of the different approaches for several different relations.

[1]  D. Cox Karl Pearson and the Chi-Squared Test , 2002 .

[2]  Gang Wang,et al.  PORE: Positive-Only Relation Extraction from Wikipedia Text , 2007, ISWC/ASWC.

[3]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[4]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[7]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[8]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[9]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[10]  Naoaki Okazaki,et al.  Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web , 2009, ACL.

[11]  Kalina Bontcheva,et al.  Shallow Methods for Named Entity Coreference Resolution , 2002 .

[12]  Mark Stevenson,et al.  Dependency Pattern Models for Information Extraction , 2009 .

[13]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[14]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[15]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[16]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[17]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[18]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[19]  Alberto O. Mendelzon,et al.  Selected papers from the International Workshop on The World Wide Web and Databases , 1998 .

[20]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[21]  Mitsuru Ishizuka,et al.  Relation Extraction from Wikipedia Using Subtree Mining , 2007, AAAI.

[22]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[23]  Jun'ichi Tsujii,et al.  HPSG Parsing with Shallow Dependency Constraints , 2007, ACL.

[24]  Daniel S. Weld,et al.  Using Wikipedia to bootstrap open information extraction , 2009, SGMD.

[25]  Luis Gravano,et al.  Combining Strategies for Extracting Relations from Text Collections , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[26]  Andrew McCallum,et al.  Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text , 2006, NAACL.

[27]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.