Protein homology detection with biologically inspired features and interpretable statistical models

Computational classification of proteins using methods such as string kernels and Fisher-SVM has demonstrated great success. However, the resulting models do not offer an immediate interpretation of the underlying biological mechanisms. In this work, we propose a biologically motivated feature set combined with a sparse classifier, based on a small subset of positions and residues in protein sequences, for protein superfamily detection and show the performance of our models is comparable to that of the state-of-the-art methods on a benchmark dataset. The set of sparse critical features discovered by the models is consistent with the confirmed biological findings.

[1]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[2]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[3]  I. Gelfand,et al.  Determining the roles of different chain fragments in recognition of immunoglobulin fold. , 2002, Protein engineering.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Janet M. Thornton,et al.  PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids , 2004, Nucleic Acids Res..

[8]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[9]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[10]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[12]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[13]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[14]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[17]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[18]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[19]  I M Gelfand,et al.  The sequence determinants of cadherin molecules , 2001, Protein science : a publication of the Protein Society.

[20]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[21]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[22]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[23]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[24]  Israel M. Gelfand,et al.  Common features in structures and sequences of sandwich-like proteins , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[26]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[27]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.