Classification tree based protein structure distances for testing sequence-structure correlation

A methodology for testing the correlation between the sequence and structure distances of proteins is proposed. Structure distances were derived by applying a forward growing classification tree algorithm on defined physico-chemical and geometrical properties of the structures. The structure distance for every pair of proteins was defined as the number of intermediate nodes in the tree. Sequence distances were derived using pairwise sequence alignment. Then, correlation between sequence distance matrix and sequence distance matrix was tested using a Monte Carlo permutation test. The results were compared to those when the double dynamic structure alignment method (SSAP) was applied. The methodology was applied to a data set of 74 proteins belonging to 14 families. The classification tree was able to identify the protein families (the misclassification rate was R=1.4%) and a 74x74 structure distance matrix was produced. For every pair of protein sequences a dissimilarity score was recorded and a sequence distance matrix was produced. The Monte Carlo permutation produced a correlation coefficient r=0.403 (P<0.001). The SSAP method produced similar results. The proposed methodology may assist in assessing whether protein sequence distances can be predictors of protein structure distances.

[1]  Elias Zintzaras,et al.  Non-parametric classification of protein secondary structures , 2006, Comput. Biol. Medicine.

[2]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[3]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[4]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[5]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Elias Zintzaras,et al.  Growing a classification tree using the apparent misclassification rate , 1994, Comput. Appl. Biosci..

[8]  Arne Elofsson,et al.  A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[9]  K. Ginalski Comparative modeling for protein structure prediction. , 2006, Current opinion in structural biology.

[10]  Christopher Bystroff,et al.  Five Hierarchical Levels of Sequence-Structure Correlation in Proteins , 2004, Applied bioinformatics.

[11]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[12]  Mohammed J. Zaki,et al.  Predicting Protein Folding Pathways , 2005, Data Mining in Bioinformatics.

[13]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Chris Sander,et al.  Dali/FSSP classification of three-dimensional protein folds , 1997, Nucleic Acids Res..

[15]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[16]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[17]  Šarūnas Raudys Integration of Statistical and Neural Approaches , 2001 .

[18]  Wojtek J. Krzanowski,et al.  Principles of multivariate analysis : a user's perspective. oxford , 1988 .

[19]  W R Taylor,et al.  Fast structure alignment for protein databank searching , 1992, Proteins.

[20]  G. Crooks,et al.  Protein secondary structure: entropy, correlations and prediction. , 2003, Bioinformatics.

[21]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[22]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, and Classification Error of Nonparametric Linear Classification Algorithms , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  A. Kowald,et al.  A comparison of amino acid distance measures using procrustes analysis. , 1999, Computers in biology and medicine.

[24]  Steven E Brenner,et al.  Measurements of protein sequence–structure correlations , 2004, Proteins.

[25]  Elias Zintzaras,et al.  A tree-based decision rule for identifying profile groups of cases without predefined classes: application in diffuse large B-cell lymphomas , 2007, Comput. Biol. Medicine.

[26]  Stephen W. Wharton An analysis of the effects of sample size on classification performance of a histogram based cluster analysis procedure , 1984, Pattern Recognit..

[27]  Elias Zintzaras,et al.  Statistical tree classification of aphids based on morphological characteristics , 1999 .

[28]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[29]  Eörs Szathmáry,et al.  A statistical test of hypotheses on the organization and origin of the genetic code , 1992, Journal of Molecular Evolution.

[30]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[31]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.