Sentence identification of biological interactions using PATRICIA tree generated patterns and genetic algorithm optimized parameters

An important task in information retrieval is to identify sentences that contain important relationships between key concepts. In this work, we propose a novel approach to automatically extract sentence patterns that contain interactions involving concepts of molecular biology. A pattern is defined in this work as a sequence of specialized Part-of-Speech (POS) tags that capture the structure of key sentences in the scientific literature. Each candidate sentence for the classification task is encoded as a POS array and then aligned to a collection of pre-extracted patterns. The quality of the alignment is expressed as a pairwise alignment score. The most innovative component of this work is the use of a genetic algorithm (GA) to maximize the classification performance of the alignment scoring scheme. The system achieves an average F-score of 0.796 in identifying sentences which describe interactions between co-occurring biological concepts. This performance is mostly affected by the quality of the preprocessing steps such as term identification and POS tagging.

[1]  Xian Zhang,et al.  Learning Domain-Specific Knowledge from Context--THUIR at TREC 2005 Genomics Track , 2005, TREC.

[2]  Keh-Jiann Chen,et al.  PAT-Trees with the Deletion Function as the Learning Device for Linguistic Patterns , 1998, COLING-ACL.

[3]  Haibin Liu,et al.  An Unsupervised Method for Extracting Domain-specific Affixes in Biological Literature , 2007, BioNLP@ACL.

[4]  Vasileios Hatzivassiloglou,et al.  Learning anchor verbs for biological interaction patterns from published text articles , 2002, Int. J. Medical Informatics.

[5]  Kyu-Chul Lee,et al.  Finding the evidence for protein-protein interactions from PubMed abstracts , 2006, ISMB.

[6]  Haibin Liu,et al.  Identifying Interaction Sentences from Biological Literature Using Automatically Extracted Patterns , 2009, BioNLP@HLT-NAACL.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[9]  Razvan C. Bunescu,et al.  Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome , 2005, Genome Biology.

[10]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[11]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[12]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[13]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[14]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[15]  Soumya Raychaudhuri Computational text analysis for funtional genomics and bioinformatics , 2006 .

[16]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[17]  Jung-Hsien Chiang,et al.  Literature Extraction of Protein Functions Using Sentence Pattern Mining , 2005, IEEE Trans. Knowl. Data Eng..

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Rohit J. Kate,et al.  Learning to Extract Proteins and their Interactions from Medline Abstracts , 2003 .

[20]  Olivier Bodenreider,et al.  A Conceptual Framework for the Biomedical Domain , 2002 .

[21]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[22]  Andre Skusa,et al.  Extraction of biological interaction networks from scientific literature , 2005, Briefings Bioinform..

[23]  J. Cimino,et al.  Automatic knowledge acquisition from MEDLINE. , 1993, Methods of information in medicine.

[24]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[25]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[26]  Lipika Dey,et al.  Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining , 2007, Data Knowl. Eng..

[27]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[28]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[29]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[30]  Fabio Rinaldi,et al.  Mining relations in the GENIA corpus , 2004 .

[31]  Terri K. Attwood,et al.  BioIE: extracting informative sentences from the biomedical literature , 2005, Bioinform..

[32]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[33]  Haibin Liu,et al.  Finding optimal parameters for edit distance based sequence classification is NP-hard , 2009, StReBio '09.

[34]  Raymond J. Mooney,et al.  Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction , 2003, J. Mach. Learn. Res..