Building Distant Supervised Relation Extractors

A well-known drawback in building machine learning semantic relation detectors for natural language is the lack of a large number of qualified training instances for the target relations in multiple languages. Even when good results are achieved, the datasets used by the state-of-the-art approaches are rarely published. In order to address these problems, this work presents an automatic approach to build multilingual semantic relation detectors through distant supervision combining two of the largest resources of structured and unstructured content available on the Web, DBpedia and Wikipedia. We map the DBpedia ontology back to the Wikipedia text to extract more than 100.000 training instances for more than 90 DBpedia relations for English and Portuguese languages without human intervention. First, we mine the Wikipedia articles to find candidate instances for relations described in the DBpedia ontology. Second, we preprocess and normalize the data filtering out irrelevant instances. Finally, we use the normalized data to construct regularized logistic regression detectors that achieve more than 80% of F-Measure for both English and Portuguese languages. In this paper, we also compare the impact of different types of features on the accuracy of the trained detector, demonstrating significant performance improvements when combining lexical, syntactic and semantic features. Both the datasets and the code used in this research are available online.

[1]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[2]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  D. Gerber,et al.  Bootstrapping the Linked Data Web , 2011 .

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  Pablo Gamallo,et al.  Dependency-Based Text Compression for Semantic Relation Extraction , 2011 .

[7]  Eraldo Rezende Fernandes,et al.  Latent Structure Perceptron with Feature Induction for Unrestricted Coreference Resolution , 2012, EMNLP-CoNLL Shared Task.

[8]  Takahiro Hara,et al.  Wikipedia Link Structure and Text Mining for Semantic Relation Extraction , 2008, SemSearch.

[9]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[10]  Mitsuru Ishizuka,et al.  Relation Extraction from Wikipedia Using Subtree Mining , 2007, AAAI.

[11]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[12]  Jans Aasman Utilizing Federated Knowledge in Semantic Web Applications , 2008, 2008 IEEE International Conference on Semantic Computing.

[13]  Philipp Cimiano,et al.  Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction , 2007, PKDD.

[14]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[15]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[16]  Oren Etzioni,et al.  Strategies for lifelong knowledge extraction from the web , 2007, K-CAP '07.

[17]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[18]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[19]  周国栋,et al.  Kernel-Based Semantic Relation Detection and Classification via Enriched Parse Tree Structure , 2011 .

[20]  Naoaki Okazaki,et al.  Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web , 2009, ACL.

[21]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[22]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[23]  Gerhard Weikum,et al.  LEILA: Learning to Extract Information by Linguistic Analysis , 2006, OntologyLearning@COLING/ACL.

[24]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[25]  Chang Wang,et al.  Relation Extraction with Relation Topics , 2011, EMNLP.

[26]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.