Challenges for extracting biomedical knowledge from full text

At present, most biomedical Information Retrieval and Extraction tools process abstracts rather than full-text articles. The increasing availability of full text will allow more knowledge to be extracted with greater reliability. To investigate the challenges of full-text processing, we manually annotated a corpus of cited articles from a Molecular Interaction Map (Kohn, 1999). Our analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies both the amount of external knowledge required and the proportion of interactions requiring multiple or deeper inference steps. Further, it identifies a range of NLP tools required, including: identifying synonyms, and resolving coreference and negated expressions. This is important guidance for researchers engineering biomedical text processing systems.

[1]  FeldmanRonen,et al.  Rule-based extraction of experimental evidence in the biomedical domain , 2002 .

[2]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[3]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[4]  C. Nédellec,et al.  Annotation Guidelines for Machine Learning-Based Named Entity Recognition in Microbiology , 2006 .

[5]  Padmini Srinivasan,et al.  Gene Terms and English Words: An Ambiguous Mix , .

[6]  Robert J. Gaizauskas,et al.  Event coreference for information extraction , 1997 .

[7]  Jun'ichi Tsujii,et al.  An Intelligent Search Engine and GUI-based Efficient MEDLINE Search Tool Based on Deep Syntactic Parsing , 2006, ACL.

[8]  Naoaki Okazaki,et al.  A Term Recognition Approach to Acronym Recognition , 2006, ACL.

[9]  Bonnie L. Webber,et al.  Classification from Full Text: A Comparison of Canonical Sections of Scientific Papers , 2004, NLPBA/BioNLP.

[10]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[11]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[12]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[13]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[14]  Ted Briscoe,et al.  Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles , 2006, Pacific Symposium on Biocomputing.

[15]  K. Kohn Molecular interaction map of the mammalian cell cycle control and DNA repair systems. , 1999, Molecular biology of the cell.

[16]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[17]  Carol Friedman,et al.  Automatic extraction of gene and protein synonyms from MEDLINE and journal articles , 2002, AMIA.

[18]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..