Prediction of protein structural classes for low-homology sequences based on predicted secondary structure

BackgroundPrediction of protein structural classes (α, β, α + β and α/β) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-homology sequences whose average pairwise sequence identity lies between 20% and 40%, they perform relatively poorly, yielding the prediction accuracy often below 60%.ResultsWe propose a new method to predict protein structural classes on the basis of features extracted from the predicted secondary structures of proteins rather than directly from their amino acid sequences. It first uses PSIPRED to predict the secondary structure for each protein sequence. Then, the chaos game representation is employed to represent the predicted secondary structure as two time series, from which we generate a comprehensive set of 24 features using recurrence quantification analysis, K-string based information entropy and segment-based analysis. The resulting feature vectors are finally fed into a simple yet powerful Fisher's discriminant algorithm for the prediction of protein structural classes. We tested the proposed method on three benchmark datasets in low homology and achieved the overall prediction accuracies of 82.9%, 83.1% and 81.3%, respectively. Comparisons with ten existing methods showed that our method consistently performs better for all the tested datasets and the overall accuracy improvements range from 2.3% to 27.5%. A web server that implements the proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/RKS_PPSC/.ConclusionThe high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the predicted secondary structure sequences, which is capable of characterizing the sequence order information, local interactions of the secondary structural elements, and spacial arrangements of α helices and β strands. Thus, it is a valuable method to predict protein structural classes particularly for low-homology amino acid sequences.

[1]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[2]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[3]  C L Webber,et al.  Dynamical assessment of physiological systems and states using recurrence plot strategies. , 1994, Journal of applied physiology.

[4]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[5]  Lukasz A. Kurgan,et al.  Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy , 2006, Pattern Recognit..

[6]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[7]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[8]  D. Ruelle,et al.  Recurrence Plots of Dynamical Systems , 1987 .

[9]  R. Jernigan,et al.  Understanding the recognition of protein structural classes by amino acid composition , 1997, Proteins.

[10]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[11]  T. John Stonham,et al.  Fuzzy colour category map for the measurement of colour similarity and dissimilarity , 2005, Pattern Recognit..

[12]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[13]  Stefan Kramer,et al.  A new representation for protein secondary structure prediction based on frequent patterns , 2006, Bioinform..

[14]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[15]  Lukasz Kurgan,et al.  Prediction of protein structural class for the twilight zone sequences. , 2007, Biochemical and biophysical research communications.

[16]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[17]  David G. Stork,et al.  Pattern Classification , 1973 .

[18]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[19]  Kuo-Chen Chou,et al.  Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. , 2008, Journal of theoretical biology.

[20]  A. Fiser,et al.  Chaos game representation of protein structures. , 1994, Journal of molecular graphics.

[21]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[22]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[23]  J. Zbilut,et al.  Embeddings and delays as derived from quantification of recurrence plots , 1992 .

[24]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[25]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[26]  Kuo-Chen Chou,et al.  Using supervised fuzzy clustering to predict protein structural classes. , 2005, Biochemical and biophysical research communications.

[27]  G M Maggiora,et al.  Domain structural class prediction. , 1998, Protein engineering.

[28]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[29]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[30]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[31]  Angelo M Facchiano,et al.  Prediction of the protein structural class by specific peptide frequencies. , 2009, Biochimie.

[32]  Wen-Lian Hsu,et al.  HYPROSP II-A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence , 2005, Bioinform..

[33]  C. Zhang,et al.  A new approach to predict the helix/strand content of globular proteins. , 2001, Journal of theoretical biology.

[34]  Xiaoyong Zou,et al.  Using pseudo-amino acid composition and support vector machine to predict protein structural class. , 2006, Journal of theoretical biology.

[35]  Lukasz A. Kurgan,et al.  Prediction of protein structural class using novel evolutionary collocation‐based sequence representation , 2008, J. Comput. Chem..

[36]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[37]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[38]  Sanjay Jain,et al.  Low degree metabolites explain essential reactions and enhance modularity in biological networks , 2005, BMC Bioinformatics.

[39]  Zheng Yuan,et al.  How good is prediction of protein structural class by the component‐coupled method? , 2000, Proteins.

[40]  Min Huang,et al.  Position‐specific residue preference features around the ends of helices and strands and a novel strategy for the prediction of secondary structures , 2008, Protein science : a publication of the Protein Society.