Accurate Prediction of Protein Structural Class

Because of the increasing gap between the data from sequencing and structural genomics, the accurate prediction of the structural class of a protein domain solely from the primary sequence has remained a challenging problem in structural biology. Traditional sequence-based predictors generally select several sequence features and then feed them directly into a classification program to identify the structural class. The current best sequence-based predictor achieved an overall accuracy of 74.1% when tested on a widely used, non-homologous benchmark dataset 25PDB. In the present work, we built a multiple linear regression (MLR) model to convert the 440-dimensional (440D) sequence feature vector extracted from the Position Specific Scoring Matrix (PSSM) of a protein domain to a 4-dimensinal (4D) structural feature vector, which could then be used to predict the four major structural classes. We performed 10-fold cross-validation and jackknife tests of the method on a large non-homologous dataset containing 8,244 domains distributed among the four major classes. The performance of our approach outperformed all of the existing sequence-based methods and had an overall accuracy of 83.1%, which is even higher than the results of those predicted secondary structure-based methods.

[1]  Xin Chen,et al.  Prediction of protein structural classes for low-homology sequences based on predicted secondary structure , 2010, BMC Bioinformatics.

[2]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[3]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[4]  Kuo-Chen Chou,et al.  Using supervised fuzzy clustering to predict protein structural classes. , 2005, Biochemical and biophysical research communications.

[5]  Lukasz A. Kurgan,et al.  Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy , 2006, Pattern Recognit..

[6]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[7]  Y Cai,et al.  Prediction of protein structural classes by neural network. , 2000, Biochimie.

[8]  K. Chou,et al.  A key driving force in determination of protein structural classes. , 1999, Biochemical and biophysical research communications.

[9]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[10]  Angelo M. Facchiano,et al.  PreSSAPro: A software for the prediction of secondary structure by amino acid properties , 2007, Comput. Biol. Chem..

[11]  G Deléage,et al.  An algorithm for protein secondary structure prediction based on class prediction. , 1987, Protein engineering.

[12]  Zong Dai,et al.  Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis , 2009, Amino Acids.

[13]  Zhi-Ping Feng,et al.  Prediction of protein structural class by amino acid and polypeptide composition. , 2002, European journal of biochemistry.

[14]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[15]  Zheng Yuan,et al.  How good is prediction of protein structural class by the component‐coupled method? , 2000, Proteins.

[16]  Xiaoqi Zheng,et al.  Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles , 2011, Amino Acids.

[17]  Kuo-Chen Chou,et al.  Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes , 2008, J. Comput. Chem..

[18]  Angelo M Facchiano,et al.  Prediction of the protein structural class by specific peptide frequencies. , 2009, Biochimie.

[19]  Yu-Dong Cai,et al.  Support vector machines for prediction of protein domain structural class. , 2003, Journal of theoretical biology.

[20]  S H Kim,et al.  Predicting protein secondary structure content. A tandem neural network approach. , 1992, Journal of molecular biology.

[21]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[22]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[23]  C. Zhang,et al.  Prediction of protein (domain) structural classes based on amino-acid index. , 1999, European journal of biochemistry.

[24]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[25]  Lukasz A. Kurgan,et al.  Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences , 2009, BMC Bioinformatics.

[26]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[27]  R. Jernigan,et al.  Understanding the recognition of protein structural classes by amino acid composition , 1997, Proteins.

[28]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[29]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[30]  Cangzhi Jia,et al.  A high-accuracy protein structural class prediction algorithm using predicted secondary structural information. , 2010, Journal of theoretical biology.

[31]  Corrigendum to “Predicting protein structural class by functional domain composition” [Biochem. Biophys. Res. Commun. 321 (2004) 1007–1009] , 2005 .

[32]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[33]  Yu-Dong Cai,et al.  Support Vector Machines for predicting protein structural class , 2001, BMC Bioinformatics.

[34]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[35]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[36]  Yuan Yuan,et al.  Using Bagging classifier to predict protein domain structural class. , 2006, Journal of biomolecular structure & dynamics.

[37]  Lukasz Kurgan,et al.  Prediction of protein structural class for the twilight zone sequences. , 2007, Biochemical and biophysical research communications.

[38]  Kuo-Chen Chou,et al.  Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor , 2006, J. Comput. Chem..

[39]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[40]  Xiaoqi Zheng,et al.  Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. , 2010, Biochimie.

[41]  Xian-Ming Pan,et al.  Multiple linear regression for protein secondary structure prediction , 2001, Proteins.

[42]  Kuo-Chen Chou,et al.  Predicting protein structural class by functional domain composition. , 2004, Biochemical and biophysical research communications.