Indexing protein structures using suffix trees.

Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.

[1]  Ruth Nussinov,et al.  A method for simultaneous alignment of multiple protein structures , 2004, Proteins.

[2]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[3]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[4]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Sung-Hou Kim,et al.  Local feature frequency profile: a method to measure structural similarity in proteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[8]  Ambuj K. Singh,et al.  ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure , 2004, Pacific Symposium on Biocomputing.

[9]  Jan Griebsch,et al.  A Method for Fast Approximate Searching of Polypeptide Structures in the PDB , 2004, German Conference on Bioinformatics.

[10]  Ambuj K. Singh,et al.  Towards index-based similarity search for protein structure databases , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[11]  K. Mizuguchi,et al.  Comparison of spatial arrangements of secondary structural elements in proteins. , 1995, Protein engineering.

[12]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[13]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[14]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[15]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[16]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[17]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[18]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[19]  Chris Sander,et al.  3-D Lookup: Fast Protein Structure Database Searches at 90% Reliability , 1995, ISMB.

[20]  Kian-Lee Tan,et al.  An efficient index-based protein structure database searching method , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[21]  Ruth Nussinov,et al.  MultiProt - A Multiple Protein Structural Alignment Algorithm , 2002, WABI.

[22]  Ruth Nussinov,et al.  MUSTA - A General, Efficient, Automated Method for Multiple Structure Alignment and Detection of Common Motifs: Application to Proteins , 2001, J. Comput. Biol..

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  Ruth Nussinov,et al.  MASS: multiple structural alignment by secondary structures , 2003, ISMB.

[25]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Yuan-Fang Wang,et al.  CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[27]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[28]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[29]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[30]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[31]  Xin Yuan,et al.  Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins , 2005, Bioinform..