On the Importance of Comprehensible Classification Models for Protein Function Prediction

The literature on protein function prediction is currently dominated by works aimed at maximizing predictive accuracy, ignoring the important issues of validation and interpretation of discovered knowledge, which can lead to new insights and hypotheses that are biologically meaningful and advance the understanding of protein functions by biologists. The overall goal of this paper is to critically evaluate this approach, offering a refreshing new perspective on this issue, focusing not only on predictive accuracy but also on the comprehensibility of the induced protein function prediction models. More specifically, this paper aims to offer two main contributions to the area of protein function prediction. First, it presents the case for discovering comprehensible protein function prediction models from data, discussing in detail the advantages of such models, namely, increasing the confidence of the biologist in the system's predictions, leading to new insights about the data and the formulation of new biological hypotheses, and detecting errors in the data. Second, it presents a critical review of the pros and cons of several different knowledge representations that can be used in order to support the discovery of comprehensible protein function prediction models.

[1]  Simon Kasif,et al.  Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data , 2007, PloS one.

[2]  Ali Al-Shahib,et al.  Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence , 2005, Applied bioinformatics.

[3]  Alex Alves Freitas,et al.  A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set , 2006, GECCO.

[4]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[5]  Alex Alves Freitas,et al.  Predicting post-synaptic activity in proteins with data mining , 2005, ECCB/JBI.

[6]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[7]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[8]  Henrik Jacobsson,et al.  Rule Extraction from Recurrent Neural Networks: ATaxonomy and Review , 2005, Neural Computation.

[9]  Patricia C Babbitt,et al.  Can sequence determine function? , 2000, Genome Biology.

[10]  Liang-Tsung Huang,et al.  iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations , 2007, Bioinform..

[11]  Joachim Diederich,et al.  Rule Extraction from Support Vector Machines , 2008, Studies in Computational Intelligence.

[12]  Heitor Silvério Lopes,et al.  Neural networks for protein classification , 2004, Applied bioinformatics.

[13]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[14]  Kevin B. Korb,et al.  Bayesian Artificial Intelligence , 2004, Computer science and data analysis series.

[15]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[16]  Kihoon Yoon,et al.  Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results , 2006, Silico Biol..

[17]  Manpreet Singh,et al.  Human Protein Function Prediction using Decision Tree Induction , 2007 .

[18]  Jon Timmis,et al.  Proteomic applications of automated GPCR classification , 2007, Proteomics.

[19]  Zhiyong Lu,et al.  Proteome Analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations , 2004, Nucleic Acids Res..

[20]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[21]  Jan Komorowski,et al.  Predicting gene ontology biological process from temporal gene expression patterns. , 2003, Genome research.

[22]  Roland Eils,et al.  Applying Support Vector Machines for Gene ontology based gene function prediction , 2004, BMC Bioinformatics.

[23]  Alex A. Freitas,et al.  Are we really discovering ''interesting'' knowledge from data? , 2006 .

[24]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[25]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[26]  Zheng Guo,et al.  Learnability-based further prediction of gene functions in Gene Ontology. , 2004, Genomics.

[27]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[28]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[29]  W. Spears,et al.  For Every Generalization Action, Is There Really an Equal and Opposite Reaction? , 1995, ICML.

[30]  Amanda Clare,et al.  The utility of different representations of protein sequence for predicting functional class , 2001, Bioinform..

[31]  Roland Eils,et al.  GOPET: A tool for automated predictions of Gene Ontology terms , 2006, BMC Bioinformatics.

[32]  Terry Kenakin New bull's-eyes for drugs. , 2005, Scientific American.

[33]  Alex Alves Freitas,et al.  Discovering interesting knowledge from a science and technology database with a genetic algorithm , 2004, Appl. Soft Comput..

[34]  Zhiyong Lu,et al.  Proteome Analyst - Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors , 2003 .

[35]  Umar Syed,et al.  Using a mixture of probabilistic decision trees for direct prediction of protein function , 2003, RECOMB '03.

[36]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[37]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[38]  Joel R. Bock,et al.  In silico biological function attribution: a different perspective , 2004 .

[39]  Ross D. King,et al.  Homology Induction: the use of machine learning to improve sequence similarity searches , 2002, BMC Bioinformatics.

[40]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[41]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[42]  Glenn Fung,et al.  Rule extraction from linear support vector machines , 2005, KDD '05.

[43]  Marc Sebban,et al.  A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis , 2002, Bioinform..

[44]  Kwong-Sak Leung,et al.  Data Mining Using Grammar Based Genetic Programming and Applications , 2000 .

[45]  Rolf Apweiler,et al.  Filtering erroneous protein annotation , 2004, ISMB/ECCB.

[46]  Dong Xu,et al.  Genome-Scale Protein Function Prediction in Yeast Saccharomyces cerevisiae Through Integrating Multiple Sources of High-Throughput Data , 2005, Pacific Symposium on Biocomputing.

[47]  Richard J. B. Dobson,et al.  Predicting deleterious nsSNPs: an analysis of sequence and structural attributes , 2006, BMC Bioinformatics.

[48]  Boris Hayete,et al.  GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees , 2004, Pacific Symposium on Biocomputing.

[49]  Amy E. Keating,et al.  AVID: An integrative framework for discovering functional relationships among proteins , 2005, BMC Bioinformatics.

[50]  Yinghui Li,et al.  Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration , 2006, BMC Bioinformatics.

[51]  Joachim Diederich,et al.  The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks , 1998, IEEE Trans. Neural Networks.

[52]  B. Mirkin,et al.  A Feature-Based Approach to Discrimination and Prediction of Protein Folding Groups , 2022 .

[53]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[54]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[55]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[56]  Einoshin Suzuki,et al.  Discovering Interesting Exception Rules with Rule Pair , 2004 .

[57]  Zheng Guo,et al.  Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities. , 2007, Gene.

[58]  Peter A. Flach,et al.  Improved Dataset Characterisation for Meta-learning , 2002, Discovery Science.

[59]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[60]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[61]  Mark E. Davis,et al.  Insights into the kinetics of siRNA-mediated gene silencing from live-cell and live-animal bioluminescent imaging , 2006, Nucleic acids research.

[62]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[63]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[64]  Matthew N. Davies,et al.  An experimental comparison of classification algorithms for hierarchical prediction of protein function , 2007 .

[65]  Roland Eils,et al.  Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains , 2006, BMC Bioinformatics.

[66]  Amanda Clare,et al.  Functional bioinformatics for Arabidopsis thaliana , 2006, Bioinform..

[67]  Yi Pan,et al.  Transmembrane segments prediction and understanding using support vector machine and decision tree , 2006, Expert Syst. Appl..

[68]  Gene Ontology Consortium,et al.  The Gene Ontology (GO) project in 2006 , 2005, Nucleic Acids Res..

[69]  Amanda Clare,et al.  Confirmation of data mining based predictions of protein function , 2004, Bioinform..

[70]  Michael J. Pazzani,et al.  Knowledge discovery from data? , 2000, IEEE Intell. Syst..

[71]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[72]  Henrik Jacobsson,et al.  Rule extraction from recurrent neural networks , 2006 .