Improving structure alignment-based prediction of SCOP families using Vorolign Kernels

MOTIVATION The slow growth of expert-curated databases compared to experimental databases makes it necessary to build upon highly accurate automated processing pipelines to make the most of the data until curation becomes available. We address this problem in the context of protein structures and their classification into structural and functional classes, more specifically, the structural classification of proteins (SCOP). Structural alignment methods like Vorolign already provide good classification results, but effectively work in a 1-Nearest Neighbor mode. Model-based (in contrast to instance-based) approaches so far have been shown to be of limited values due to small classes arising in such classification schemes. RESULTS In this article, we describe how kernels defined in terms of Vorolign scores can be used in SVM learning, and explore variants of combined instance-based and model-based learning, up to exclusively model-based learning. Our results suggest that kernels based on Vorolign scores are effective and that model-based learning can yield highly competitive classification results for the prediction of SCOP families. AVAILABILITY The code is made available at: http://wwwkramer.in.tum.de/research/applications/vorolign-kernel.

[1]  T. Mockler,et al.  Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology , 2008, Nucleic acids research.

[2]  Ralf Zimmer,et al.  AutoPSI: a database for automatic structural classification of protein sequences and structures , 2008, Nucleic Acids Res..

[3]  Bernard Haasdonk,et al.  Feature space interpretation of SVMs with indefinite kernels , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ralf Zimmer,et al.  Protein structure alignment considering phenotypic plasticity , 2008, ECCB.

[5]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[6]  Maya R. Gupta,et al.  Learning kernels from indefinite similarities , 2009, ICML '09.

[7]  Ralf Zimmer,et al.  Vorolign - fast structural alignment using Voronoi contacts , 2007, Bioinform..

[8]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[9]  Yuan Qi,et al.  SCOPmap: Automated assignment of protein structures to evolutionary superfamilies , 2004, BMC Bioinformatics.

[10]  Arthur Zimek,et al.  A Study of Hierarchical and Flat Classification of Proteins , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Maya R. Gupta,et al.  Similarity-based Classification: Concepts and Algorithms , 2009, J. Mach. Learn. Res..

[12]  Ralf Zimmer,et al.  AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings , 2007, Bioinform..

[13]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Bonnie Kirkpatrick,et al.  STRALCP—structure alignment-based clustering of proteins , 2007, Nucleic acids research.

[15]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[16]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[17]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[18]  Jason Weston,et al.  Combining classifiers for improved classification of proteins from sequence or structure , 2008, BMC Bioinformatics.

[19]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[20]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[21]  Wei Wang,et al.  Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis , 2003, Pacific Symposium on Biocomputing.