Enzyme function prediction with interpretable models.

Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

[1]  S. T. Buckland,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[2]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[3]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[6]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[7]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[8]  Golan Yona,et al.  Expectation-maximization algorithms for fuzzy assignment of genes to cellular pathways. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[9]  D. Mould,et al.  Development of hydrophobicity parameters to analyze proteins which bear post- or cotranslational modifications. , 1991, Analytical biochemistry.

[10]  J. Skolnick,et al.  EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. , 2004, Nucleic acids research.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Anders Krogh,et al.  SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM , 1995 .

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[15]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[16]  Danielle Azar,et al.  Software Systems , 2008 .

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  D. Vitkup,et al.  Predicting genes for orphan metabolic activities using phylogenetic profiles , 2006, Genome Biology.

[19]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[20]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[21]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[22]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[23]  Umar Syed,et al.  Using a mixture of probabilistic decision trees for direct prediction of protein function , 2003, RECOMB '03.

[24]  Peter D. Karp,et al.  MetaCyc: a multiorganism database of metabolic pathways and enzymes , 2005, Nucleic Acids Res..

[25]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[26]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[27]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[28]  A. Baucom,et al.  Predicting protein function from structure: unique structural features of proteases. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[31]  D. Irwin,et al.  Genetics and Properties of Cellulases , 1999 .

[32]  Walter R. Gilks,et al.  Probabilistic annotation of protein sequences based on functional classifications , 2005, BMC Bioinformatics.

[33]  Goran Neshich,et al.  Predicting enzyme class from protein structure using Bayesian classification. , 2006, Genetics and molecular research : GMR.

[34]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[35]  E A Ferrán,et al.  Self‐organized neural maps of human protein sequences , 1994, Protein science : a publication of the Protein Society.

[36]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[37]  Peter D. Karp,et al.  Prediction of Enzyme Classification from Protein Sequence without the Use of Sequence Similarity , 1997, ISMB.

[38]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[39]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[40]  J. S. Urban Hjorth,et al.  Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap , 1993 .

[41]  Kuo-Chen Chou,et al.  Using functional domain composition to predict enzyme family classes. , 2005, Journal of proteome research.

[42]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[43]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[44]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[45]  Golan Yona,et al.  Automation of gene assignments to metabolic pathways using high-throughput expression data , 2005, BMC Bioinformatics.

[46]  Yoav Freund,et al.  Identifying metabolic enzymes with multiple types of association evidence , 2006, BMC Bioinformatics.

[47]  Lawrence Hunter,et al.  Predicting Enzyme Function from Sequence: A Systematic Appraisal , 1997, ISMB.

[48]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[49]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[50]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[51]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[52]  Golan Yona,et al.  BIOZON: a hub of heterogeneous biological data , 2006, Nucleic Acids Res..

[53]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[54]  Anil K. Jain,et al.  Bootstrap Techniques for Error Estimation , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  H. Mewes,et al.  SNAPping up functionally related genes based on context information: a colinearity-free approach. , 2001, Journal of molecular biology.

[56]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[57]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[58]  Peter D. Karp,et al.  Evaluation of computational metabolic-pathway predictions for Helicobacter pylori , 2002, Bioinform..

[59]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[60]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[61]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[62]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[63]  Ran El-Yaniv,et al.  Smoothed Bootstrap and Statistical Data Cloning for Classifier Evaluation , 2001, ICML.

[64]  Peter D. Karp,et al.  A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases , 2004, BMC Bioinformatics.

[65]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[66]  M. Kanehisa,et al.  Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. , 1998, Genome research.