Protein homology detection with sparse models

Establishing structural or functional relationship between sequences, for instance to infer the structural class of an unannotated protein, is a key task in biological analysis. Protein sequences undergo complex transformations such as mutation, insertion and deletion during the evolutionary process and typically share low sequence similarity on the superfamily level, making the task for remote homology detection based on primary sequence only very challenging. Based on previous studies stating that knowledge based on only a subset of critical positions and the preferred symbols on such positions are sufficient for remote homology detection, we present a series of works, each enforcing different notion of sparsity, to recover such critical positions. We first start with a generative model and present the sparse profile hidden Markov models. Such generative approach recovers some critical patterns and motivates the need for discriminative learning. In our second study, we present a discriminative approach to recover such critical positions and the preferred symbols. In our third study, we address the issue of very few positive training examples, accompanied by a large number of negative training examples, which is typical in many remote homology detection task. Such issue motivates the need for semi-supervised learning. However, though containing abundant useful and critical information, large uncurated sequence databases also contain a lot of noise, which may compromise the quality of the classifiers. As a result, we present a systematic and biologically motivated framework for semi-supervised learning with large uncurated sequence databases. Combined with a very fast string kernel, our method not only realizes rapid and accurate remote homology detection and show state-of-the-art performance, but also recovers some critical patterns conserved in superfamilies.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[3]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[4]  Tom. Mitchell GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning , 2005 .

[5]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[6]  Israel M. Gelfand,et al.  Common features in structures and sequences of sandwich-like proteins , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[8]  J. Weston,et al.  Support Vector Machines for Multi-class Pattern Recognition 1. K-class Pattern Recognition 2. Solving K-class Problems with Binary Svms , 1999 .

[9]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[10]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[12]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[13]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[14]  Nicu Sebe,et al.  Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Jerald F. Lawless,et al.  Statistical Models and Methods for Lifetime Data. , 1983 .

[18]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[20]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  Vladimir Pavlovic,et al.  Protein homology detection with biologically inspired features and interpretable statistical models , 2008, Int. J. Data Min. Bioinform..

[23]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[24]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[25]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, CVPR 2004.

[28]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[29]  J. Lawless Statistical Models and Methods for Lifetime Data , 2002 .

[30]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[32]  V. Pavlovic,et al.  Spatially-constrained sample kernel for sequence classification , 2008 .

[33]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[34]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[36]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[37]  Janet M. Thornton,et al.  PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids , 2004, Nucleic Acids Res..

[38]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[39]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[40]  I. Gelfand,et al.  Determining the roles of different chain fragments in recognition of immunoglobulin fold. , 2002, Protein engineering.

[41]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[42]  I M Gelfand,et al.  The sequence determinants of cadherin molecules , 2001, Protein science : a publication of the Protein Society.

[43]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[44]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[45]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[46]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[47]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[48]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[49]  T. Hubbard,et al.  Fold recognition and ab initio structure predictions using hidden markov models and β‐strand pair potentials , 1995, Proteins.

[50]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[51]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[52]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[53]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[54]  Anil K. Jain,et al.  Bayesian learning of sparse classifiers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[55]  Henry Tirri,et al.  On Discriminative Bayesian Network Classifiers and Logistic Regression , 2005, Machine Learning.

[56]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[57]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[58]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[59]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[60]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[61]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[62]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[63]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[65]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[66]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[67]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.