A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation

MOTIVATION Fold recognition is an important step in protein structure and function prediction. Traditional sequence comparison methods fail to identify reliable homologies with low sequence identity, while the taxonomic methods are effective alternatives, but their prediction accuracies are around 70%, which are still relatively low for practical usage. RESULTS In this study, a simple and powerful method is presented for taxonomic fold recognition, which combines support vector machine (SVM) with autocross-covariance (ACC) transformation. The evolutionary information represented in the form of position-specific score matrices is converted into a series of fixed-length vectors by ACC transformation and these vectors are then input to a SVM classifier for fold recognition. The sequence-order effect can be effectively captured by this scheme. Experiments are performed on the widely used D-B dataset and the corresponding extended dataset, respectively. The proposed method, called ACCFold, gets an overall accuracy of 70.1% on the D-B dataset, which is higher than major existing taxonomic methods by 2-14%. Furthermore, the method achieves an overall accuracy of 87.6% on the extended dataset, which surpasses major existing taxonomic methods by 9-17%. Additionally, our method obtains an overall accuracy of 80.9% for 86-folds and 77.2% for 199-folds. These results demonstrate that the ACCFold method provides the state-of-the-art performance for taxonomic fold recognition. AVAILABILITY The source code for ACC transformation is freely available at http://www.iipl.fudan.edu.cn/demo/accpkg.html.

[1]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[2]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[3]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[4]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[5]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[6]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[7]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[8]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction—Round VII , 2007, Proteins.

[9]  Jinbo Xu Fold recognition by predicted alignment accuracy , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[11]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[12]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[13]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[14]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Jason Weston,et al.  Multi-class protein fold recognition using adaptive codes , 2005, ICML.

[17]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[18]  P. Deschavanne,et al.  Enhanced protein fold recognition using a structural alphabet , 2009, Proteins.

[19]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[20]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[21]  Menglong Li,et al.  Predicting G‐protein coupled receptors–G‐protein coupling specificity based on autocross‐covariance transform , 2006, Proteins.

[22]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.

[23]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[24]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[25]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[26]  Xieping Gao,et al.  A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[27]  A GirolamiMark,et al.  Probabilistic multi-class multi-kernel learning , 2008 .

[28]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Sitao Wu,et al.  LOMETS: A local meta-threading-server for protein structure prediction , 2007, Nucleic acids research.

[31]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[32]  Silvio C. E. Tosatto,et al.  MANIFOLD: protein fold recognition based on secondary structure, sequence similarity and enzyme classification. , 2003, Protein engineering.

[33]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[34]  Wei Zhang,et al.  SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model , 2008, PloS one.

[35]  George Karypis,et al.  Building multiclass classifiers for remote homology detection and fold recognition , 2006, BMC Bioinformatics.

[36]  Jason Weston,et al.  Combining classifiers for improved classification of proteins from sequence or structure , 2008, BMC Bioinformatics.