Prosima: Protein similarity algorithm

In this article, we present a novel algorithm for measuring protein similarity based on their three dimensional structure (protein tertiary structure). The PROSIMA algorithm using suffix tress for discovering common parts of main-chains of all proteins appearing in current NCSB protein data bank (PDB). By identifying these common parts we build a vector model and next use classical information retrieval tasks based on the vector model to measure the similarity between proteins - all to all protein similarity. For the calculation of protein similarity we are using tf-idf term weighing schema and cosine similarity measure. The goal of this work to use the whole current PDB database (downloaded on June 2009) of known proteins, not just some kinds of selections of this database, which have been studied in other works. We have chose the SCOP database for verification of precision of our algorithm because it is maintained primarily by humans. The next success of this work is to be able to determine protein SCOP categories of proteins not included in the latest version of the SCOP database (v. 1.75) with nearly 100% precision.

[1]  World Congress on Nature & Biologically Inspired Computing, NaBIC 2009, 9-11 December 2009, Coimbatore, India , 2009, NaBIC.

[2]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[3]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[6]  Martin Halvey,et al.  WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[7]  Guohua Wang,et al.  Using RNase sequence specificity to refine the identification of RNA-protein binding regions , 2008, BMC Genomics.

[8]  Jack Y. Yang,et al.  Investigation of transmembrane proteins using a computational approach , 2008, BMC Genomics.

[9]  David Haussler,et al.  A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[10]  Tetsuo Shibuya Geometric Suffix Tree: A New Index Structure for Protein 3-D Structures , 2006, CPM.

[11]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[12]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[13]  Feng Gao,et al.  PSIST: indexing protein structures using suffix trees , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[16]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[17]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[18]  Ivan Zelinka,et al.  Data-Mining Protein Structure by Clustering, Segmentation and Evolutionary Algorithms , 2009, Foundations of Computational Intelligence.

[19]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[22]  Václav Snásel,et al.  Vector model improvement using suffix trees , 2007, 2007 2nd International Conference on Digital Information Management.

[23]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Oren Etzioni,et al.  Clustering web documents: a phrase-based method for grouping search engine results , 1999 .

[25]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[26]  Jianlin Cheng,et al.  Protein disorder prediction at multiple levels of sensitivity and specificity , 2008, BMC Genomics.