FASSM: Enhanced Function Association in Whole Genome Analysis Using Sequence and Structural Motifs

We present an algorithm to detect remote homology, which arises through circular permutation and discontinuous domains. It is also helpful in detecting small domain proteins that are characterized by few conserved residues. The input to the algorithm is a set of multiply aligned protein sequence profiles. This method, coded as FASSM, examines the sequence conservation and positions of protein family signatures or motifs for the annotation of protein sequences and to facilitate the analysis of their domains. The overall coverage of FASSM is 93% in comparison to other validation tools like HMM and IMPALA. The method is especially useful for difficult relationships such as discontinuous domains during whole-genome surveys and is demonstrated to perform accurate family associations at sequence identities as low as 15%.

[1]  A. Panchenko,et al.  A comparison of position‐specific score matrices based on sequence and structure alignments , 2002, Protein science : a publication of the Protein Society.

[2]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[3]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[4]  M. Gerstein,et al.  Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins , 2001, Genome Research.

[5]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[6]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[7]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[8]  J. Stephen Judd,et al.  Optimal stopping and effective machine complexity in learning , 1993, Proceedings of 1995 IEEE International Symposium on Information Theory.

[9]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[10]  Gajendra P. S. Raghava,et al.  A neural network method for prediction of ?-turn types in proteins using evolutionary information , 2004, Bioinform..

[11]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[12]  K. Chou,et al.  A key driving force in determination of protein structural classes. , 1999, Biochemical and biophysical research communications.

[13]  Michael G. Rossmann,et al.  Chemical and biological evolution of a nucleotide-binding protein , 1974, Nature.

[14]  Judith E. Dayhoff,et al.  Neural Network Architectures: An Introduction , 1989 .

[15]  Shashi B. Pandit,et al.  SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[16]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[17]  S V Evans,et al.  SETOR: hardware-lighted three-dimensional solid model representations of macromolecules. , 1993, Journal of molecular graphics.

[18]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[19]  P E Bourne,et al.  The Protein Data Bank. , 2002, Nucleic acids research.

[20]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[21]  Michael Perrone,et al.  Putting It All Together: Methods for Combining Neural Networks , 1993, NIPS.

[22]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[23]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[24]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[25]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[26]  K C Chou,et al.  An analysis of protein folding type prediction by seed-propagated sampling and jackknife test , 1995, Journal of protein chemistry.

[27]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.