Datasets for generic relation extraction*

A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from 'one' to 'Amidu Berry' in the membership relation described in 'Amidu Berry, one half of PBS'. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora. 1

[1]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[4]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[5]  Kate Byrne,et al.  Populating the Semantic Web: Combining Text and Relational Databases as RDF , 2010 .

[6]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[7]  Jens Lehmann,et al.  Triplify: light-weight linked data publication from relational databases , 2009, WWW '09.

[8]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[9]  James Pustejovsky,et al.  Representing Temporal and Event Knowledge for QA Systems , 2004, New Directions in Question Answering.

[10]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[11]  Tom Hampton,et al.  SRA: Description of the IE2 System Used for MUC-7 , 1998, MUC.

[12]  David A. Smith,et al.  Detecting and Browsing Events in Unstructured text , 2002, SIGIR '02.

[13]  M. T. Lino,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation , 2004 .

[14]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[15]  Ben Hachey Multi-Document Summarisation Using Generic Relation Extraction , 2009, EMNLP.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Benjamin Hachey,et al.  Towards generic relation extraction , 2009 .

[18]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[19]  Yang Jin,et al.  Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE , 2005, ACL.

[20]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[21]  Vasileios Hatzivassiloglou,et al.  Marking atomic events in sets of related texts , 2003, RANLP.

[22]  Tapio Salakoski,et al.  Complex-to-Pairwise Mapping of Biological Relationships using a Semantic Network Representation , 2008 .

[23]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[24]  Mark T. Maybury New Directions in Question Answering , 2004 .

[25]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[26]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[27]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[28]  Claire Grover,et al.  Tools to Address the Interdependence between Tokenisation and Standoff Annotation , 2006, NLPXML@EACL.

[29]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[30]  Jack G. Conrad,et al.  A system for discovering relationships by feature extraction from text databases , 1994, SIGIR '94.

[31]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[32]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[33]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[34]  Annotation Guidelines for Relation Detection and Characterization ( RDC ) Version 4 . 3-20040122 1 , .

[35]  Helen L. Johnson,et al.  Corpus Refactoring: a Feasibility Study , 2007, Journal of biomedical discovery and collaboration.

[36]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[37]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[38]  John A. Carroll,et al.  Robust, applied morphological generation , 2000, INLG.

[39]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[40]  K. Bretonnel Cohen,et al.  A critical review of PASBio's argument structures for biomedical verbs , 2006, BMC Bioinformatics.

[41]  A. U.S.,et al.  Unsupervised Paraphrase Acquisition via Relation Discovery , 2005 .

[42]  ACE (Automatic Content Extraction) Chinese Annotation Guidelines for Relations , 2005 .

[43]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[44]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.

[45]  William W. Cohen,et al.  A graph-search framework for associating gene identifiers with documents , 2006, BMC Bioinformatics.

[46]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.