HHsvm: fast and accurate classification of profile-profile matches identified by HHsearch

MOTIVATION Recently developed profile-profile methods rival structural comparisons in their ability to detect homology between distantly related proteins. Despite this tremendous progress, many genuine relationships between protein families cannot be recognized as comparisons of their profiles result in scores that are statistically insignificant. RESULTS Using known evolutionary relationships among protein superfamilies in SCOP database, support vector machines were trained on four sets of discriminatory features derived from the output of HHsearch. Upon validation, it was shown that the automatic classification of all profile-profile matches was superior to fixed threshold-based annotation in terms of sensitivity and specificity. The effectiveness of this approach was demonstrated by annotating several domains of unknown function from the Pfam database. AVAILABILITY Programs and scripts implementing the methods described in this manuscript are freely available from http://hhsvm.dlakiclab.org/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[2]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[3]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[4]  Christine A. Orengo,et al.  Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone , 2007, Bioinform..

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[7]  Marcin von Grotthuss,et al.  Detecting distant homology with Meta-BASIC , 2004, Nucleic Acids Res..

[8]  Alex Bateman,et al.  Visualizing profile-profile alignment: pairwise HMM logos , 2005, Bioinform..

[9]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Johannes Söding,et al.  Prediction of protein functional residues from sequence by probability density estimation , 2008, Bioinform..

[11]  Nick V Grishin,et al.  Discrimination between distant homologs and structural analogs: lessons from manually constructed, reliable data sets. , 2008, Journal of molecular biology.

[12]  Jonathan Casper,et al.  Combining local‐structure, fold‐recognition, and new fold methods for protein structure prediction , 2003, Proteins.

[13]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[14]  Janusz M. Bujnicki,et al.  Structural and evolutionary classification of Type II restriction enzymes based on theoretical and experimental analyses , 2008, Nucleic acids research.

[15]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[16]  E V Koonin,et al.  SURVEY AND SUMMARY: holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories. , 2000, Nucleic acids research.

[17]  Ling Li,et al.  Support Vector Machinery for Infinite Ensemble Learning , 2008, J. Mach. Learn. Res..

[18]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[19]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[20]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[23]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[24]  N. Grishin,et al.  Realm of PD-(D/E)XK nuclease superfamily revisited: detection of novel families with modified transitive meta profile searches , 2007, BMC Structural Biology.

[25]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[28]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[29]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[30]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[31]  崇 小野田 解説 Large Margin Classifiers--Introduction to Large Margin Classifiers , 2002 .

[32]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.