Challenges for automatically extracting molecular interactions from full-text articles

BackgroundThe increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles.ResultsWe manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set.ConclusionWe introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.

[1]  Alexander A. Morgan,et al.  Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[2]  Osman Ugur Sezerman,et al.  Application of Automatic Mutation-gene Pair Extraction to Diseases , 2007, J. Bioinform. Comput. Biol..

[3]  Naoaki Okazaki,et al.  A Term Recognition Approach to Acronym Recognition , 2006, ACL.

[4]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[5]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[6]  Razvan C. Bunescu,et al.  Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.

[7]  Robert J. Gaizauskas,et al.  Event coreference for information extraction , 1997 .

[8]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[9]  K. Bretonnel Cohen,et al.  Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[10]  Eric Gaussier,et al.  Annotating a large corpus with anaphoric links , 2000 .

[11]  K. Hyland,et al.  Writing Without Conviction? Hedging in Science Research Articles , 1996 .

[12]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[13]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[14]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[15]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..

[16]  K. Kohn Molecular interaction map of the mammalian cell cycle control and DNA repair systems. , 1999, Molecular biology of the cell.

[17]  Edgar Meij,et al.  Bootstrapping Language Associated with Biomedical Entities The AID Group at TREC Genomics 2007 , 2007 .

[18]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[19]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[20]  Robert E. Mercer,et al.  A Design Methodology for a Biomedical Literature Indexing Tool Using the Rhetoric of Science , 2004, HLT-NAACL 2004.

[21]  Padmini Srinivasan,et al.  The Language of Bioscience: Facts, Speculations, and Statements In Between , 2004, HLT-NAACL 2004.

[22]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[23]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[24]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[25]  Jun'ichi Tsujii,et al.  Syntax Annotation for the GENIA Corpus , 2005, IJCNLP.

[26]  Padmini Srinivasan,et al.  Gene Terms and English Words: An Ambiguous Mix , .

[27]  Jun'ichi Tsujii,et al.  An Intelligent Search Engine and GUI-based Efficient MEDLINE Search Tool Based on Deep Syntactic Parsing , 2006, ACL.

[28]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[29]  Edgar Meij,et al.  Bootstrapping Language Associated with Biomedical Entities , 2007, TREC.

[30]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[31]  Janet Holmes,et al.  Doubt and Certainty in ESL Textbooks , 1988 .

[32]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[33]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[34]  Bonnie L. Webber,et al.  Classification from Full Text: A Comparison of Canonical Sections of Scientific Papers , 2004, NLPBA/BioNLP.

[35]  Carol Friedman,et al.  Automatic extraction of gene and protein synonyms from MEDLINE and journal articles , 2002, AMIA.

[36]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[37]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[38]  Ted Briscoe,et al.  Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles , 2006, Pacific Symposium on Biocomputing.