Remote protein homology detection and fold recognition using two-layer support vector machine classifiers

Remote protein homology detection and fold recognition refer to detection of structural homology in proteins where there are small or no similarities in the sequence. To detect protein structural classes from protein primary sequence information, homology-based methods have been developed, which can be divided to three types: discriminative classifiers, generative models for protein families and pairwise sequence comparisons. Support Vector Machines (SVM) and Neural Networks (NN) are two popular discriminative methods. Recent studies have shown that SVM has fast speed during training, more accurate and efficient compared to NN. We present a comprehensive method based on two-layer classifiers. The 1st layer is used to detect up to superfamily and family in SCOP hierarchy using optimized binary SVM classification rules. It used the kernel function known as the Bio-kernel, which incorporates the biological information in the classification process. The 2nd layer uses discriminative SVM algorithm with string kernel that will detect up to protein fold level in SCOP hierarchy. The results obtained were evaluated using mean ROC and mean MRFP and the significance of the result produced with pairwise t-test was tested. Experimental results show that our approaches significantly improve the performance of remote protein homology detection and fold recognition for all three different version SCOP datasets (1.53, 1.67 and 1.73). We achieved 4.19% improvements in term of mean ROC in SCOP 1.53, 4.75% in SCOP 1.67 and 4.03% in SCOP 1.73 datasets when compared to the result produced by well-known methods. The combination of first layer and second layer of BioSVM-2L performs well in remote homology detection and fold recognition even in three different versions of datasets.

[1]  P Bork,et al.  Structure and distribution of modules in extracellular proteins , 1996, Quarterly Reviews of Biophysics.

[2]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[3]  A. V. McDonnell,et al.  Fold recognition and accurate sequence–structure alignment of sequences directing β‐sheet proteins , 2006, Proteins.

[4]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[5]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[6]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[7]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[8]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[9]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[10]  Zoubin Ghahramani,et al.  A Bayesian network model for protein fold and remote homologue recognition , 2002, Bioinform..

[11]  Sean R Eddy,et al.  Where did the BLOSUM62 alignment score matrix come from? , 2004, Nature Biotechnology.

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  François Laviolette,et al.  HIV-1 coreceptor usage prediction without multiple alignments: an application of string kernels , 2008, Retrovirology.

[14]  Hongyi Zhou,et al.  Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.

[15]  W. Scovell,et al.  TFIIA abrogates the effects of inhibition by HMGB1 but not E1A during the early stages of assembly of the transcriptional preinitiation complex. , 2003, Biochimica et biophysica acta.

[16]  Haoran Zhang,et al.  Solving large-scale multiclass learning problems via an efficient support vector classifier , 2006 .

[17]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[18]  Ankush Mittal,et al.  Protein Structure and Fold Prediction Using Tree-augmented Naïve Bayesian Classifier , 2005, J. Bioinform. Comput. Biol..

[19]  Jason Weston,et al.  SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition , 2007, BMC Bioinformatics.

[20]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[21]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[22]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[23]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[24]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[25]  Tao Dong,et al.  Primary CMV infection: a co-factor HIV-1 disease progression in African infants? , 2008, Retrovirology.

[26]  Gordon J. Pace,et al.  Support vector machines with profile-based kernels for remote protein homology detection. , 2004, Genome informatics. International Conference on Genome Informatics.

[27]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[28]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[29]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[30]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[31]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[32]  Chan-seok Jeong,et al.  Fold recognition by combining profile-profile alignment and support vector machine , 2005, Bioinform..

[33]  Ralf Zimmer,et al.  Fast Protein Fold Recognition and Accurate Sequence to Structure Alignment , 1996, German Conference on Bioinformatics.

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Masao Honda,et al.  cDNA microarray analysis of autoimmune hepatitis, primary biliary cirrhosis and consecutive disease manifestation. , 2005, Journal of autoimmunity.

[36]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[37]  L M Yu,et al.  Elicitins from Phytophthora and basic resistance in tobacco. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[39]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[40]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[41]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[42]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[43]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[44]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[45]  Michele Vendruscolo,et al.  A glimpse at the organization of the protein universe. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[47]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[48]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[49]  R. Copley,et al.  Fold recognition using sequence and secondary structure information , 1999, Proteins.

[50]  D C Liang,et al.  Crystal Structure of Allophycocyanin from Red AlgaePorphyra yezoensis at 2.2-Å Resolution* , 1999, The Journal of Biological Chemistry.

[51]  Lei Lin,et al.  A pattern-based SVM for protein remote homology detection , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[52]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[53]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[54]  Bernard De Baets,et al.  A Graph-theoretic Approach for Reducing One-versus-one Multi-class Classification to Ranking. International Workshop on Mining and Learning with Graphs , 2008, MLG 2008.

[55]  Xiaolong Wang,et al.  A Discriminative Method for Protein Remote Homology Detection Based on N-nary Profiles , 2008, BIRD.

[56]  G J Kleywegt,et al.  Binding site differences revealed by crystal structures of Plasmodium falciparum and bovine acyl-CoA binding protein. , 2001, Journal of molecular biology.

[57]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[58]  Dan Graur,et al.  Characterization of pairwise and multiple sequence alignment errors. , 2009, Gene.

[59]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[60]  Adam Yao,et al.  Functional analysis of novel SNPs and mutations in human and mouse genomes , 2008, BMC Bioinformatics.

[61]  George Karypis,et al.  Building multiclass classifiers for remote homology detection and fold recognition , 2006, BMC Bioinformatics.