Identifying Interaction Sentences from Biological Literature Using Automatically Extracted Patterns

An important task in information retrieval is to identify sentences that contain important relationships between key concepts. In this work, we propose a novel approach to automatically extract sentence patterns that contain interactions involving concepts of molecular biology. A pattern is defined in this work as a sequence of specialized Part-of-Speech (POS) tags that capture the structure of key sentences in the scientific literature. Each candidate sentence for the classification task is encoded as a POS array and then aligned to a collection of pre-extracted patterns. The quality of the alignment is expressed as a pairwise alignment score. The most innovative component of this work is the use of a Genetic Algorithm (GA) to maximize the classification performance of the alignment scoring scheme. The system achieves an F-score of 0.834 in identifying sentences which describe interactions between biological entities. This performance is mostly affected by the quality of the preprocessing steps such as term identification and POS tagging.

[1]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[2]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[3]  Nick Cercone,et al.  Biological Named Entity Recognition Using n-grams and Classification Methods , 2005 .

[4]  J. E. Freund,et al.  Modern Elementary Statistics , 1968 .

[5]  Keh-Jiann Chen,et al.  PAT-Trees with the Deletion Function as the Learning Device for Linguistic Patterns , 1998, COLING-ACL.

[6]  Olivier Bodenreider,et al.  A Conceptual Framework for the Biomedical Domain , 2002 .

[7]  Terri K. Attwood,et al.  BioIE: extracting informative sentences from the biomedical literature , 2005, Bioinform..

[8]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[9]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[10]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[11]  Andre Skusa,et al.  Extraction of biological interaction networks from scientific literature , 2005, Briefings Bioinform..

[12]  J. Hakenberg,et al.  Learning Patterns for Information Extraction from Free Text , 2005 .

[13]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[14]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[15]  J. E. Freund,et al.  Modern elementary statistics , 1953 .

[16]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[17]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[18]  Xian Zhang,et al.  Learning Domain-Specific Knowledge from Context--THUIR at TREC 2005 Genomics Track , 2005, TREC.

[19]  Kyu-Chul Lee,et al.  Finding the evidence for protein-protein interactions from PubMed abstracts , 2006, ISMB.

[20]  Soumya Raychaudhuri Computational text analysis for funtional genomics and bioinformatics , 2006 .