论文信息 - Challenges for automatically extracting molecular interactions from full-text articles

Challenges for automatically extracting molecular interactions from full-text articles

BackgroundThe increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles.ResultsWe manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set.ConclusionWe introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.

James R. Curran | Tara McIntosh

[1] Alexander A. Morgan,et al. Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[2] Osman Ugur Sezerman,et al. Application of Automatic Mutation-gene Pair Extraction to Diseases , 2007, J. Bioinform. Comput. Biol..

[3] Naoaki Okazaki,et al. A Term Recognition Approach to Acronym Recognition , 2006, ACL.

[4] Ronen Feldman,et al. Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[5] Miguel A. Andrade-Navarro,et al. Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[6] Razvan C. Bunescu,et al. Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.

[7] Robert J. Gaizauskas,et al. Event coreference for information extraction , 1997 .

[8] Adwait Ratnaparkhi,et al. A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[9] K. Bretonnel Cohen,et al. Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[10] Eric Gaussier,et al. Annotating a large corpus with anaphoric links , 2000 .

[11] K. Hyland,et al. Writing Without Conviction? Hedging in Science Research Articles , 1996 .