Prediction of Metabolic Pathways Involvement in Prokaryotic UniProtKB Data by Association Rule Mining

The widening gap between known proteins and their functions has encouraged the development of methods to automatically infer annotations. Functional annotation of proteins encoded in newly sequenced genomes is expected to meet the conflicting requirements of providing as much comprehensive information as possible while avoiding erroneous functional assignments and over-predictions. This trade-off imposes a great challenge in designing intelligent systems to tackle the problem of automatic protein annotation. In this work, we present a system that utilizes rule mining techniques to predict metabolic pathways in prokaryotes. The resulting knowledge represents predictive models explaining pathway involvement of UniProtKB entries. We carried out an evaluation study of our system performance using semantic similarity and cross-validation technique. We found that it achieved a very high accuracy of pathway identification with an F1-measure of 0.987 and AUC of 0.99. Then, our prediction models were successfully applied on 4.6 milion UniProtKB/TrEMBL reference proteome entries of prokaryotes. As results, 551,418 entries were covered, where 371,265 of them lacked any previous pathway annotations.

[1]  -. M.C.Munoz,et al.  Gene Ontology Consortium: going forward , 2017 .

[2]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[3]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[4]  Engelbert Mephu Nguifo,et al.  Mining Undominated Association Rules Through Interestingness Measures , 2014, Int. J. Artif. Intell. Tools.

[5]  Rachael P. Huntley,et al.  Standardized description of scientific evidence using the Evidence Ontology (ECO) , 2014, Database J. Biol. Databases Curation.

[6]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[7]  Elisabeth Coudert,et al.  HAMAP in 2013, new developments in the protein family classification and annotation system , 2012, Nucleic Acids Res..

[8]  Engelbert Mephu Nguifo,et al.  Ranking and Selecting Association Rules Based on Dominance Relationship , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[9]  Roded Sharan,et al.  Large-Scale Elucidation of Drug Response Pathways in Humans , 2012, J. Comput. Biol..

[10]  Peter D. Karp,et al.  The Pathway Tools Pathway Prediction Algorithm , 2011, Standards in genomic sciences.

[11]  Xia Li,et al.  A sub-pathway-based approach for identifying drug response principal network , 2011, Bioinform..

[12]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[13]  M. Boulton,et al.  Activation of the Wnt pathway plays a pathogenic role in diabetic retinopathy in humans and animal models. , 2009, The American journal of pathology.

[14]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[15]  Eli Upfal,et al.  An efficient rigorous approach for identifying statistically significant frequent itemsets , 2009, JACM.

[16]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[17]  Thy-Hou Lin,et al.  Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences , 2006, BMC Bioinformatics.

[18]  Mikhail S. Gelfand,et al.  Mining sequence annotation databanks for association patterns , 2005, Bioinform..

[19]  Olivier Bodenreider,et al.  Non-Lexical Approaches to Identifying Associative Relations in the Gene Ontology , 2004, Pacific Symposium on Biocomputing.

[20]  Lynda B. M. Ellis,et al.  Encoding microbial metabolic logic: predicting biodegradation , 2004, Journal of Industrial Microbiology & Biotechnology.

[21]  Christian Borgelt Recursion Pruning for the Apriori Algorithm , 2004, FIMI.

[22]  Philip N. Judson,et al.  Using Absolute and Relative Reasoning in the Prediction of the Potential Metabolism of Xenobiotics , 2003, J. Chem. Inf. Comput. Sci..

[23]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[24]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[25]  Rolf Apweiler,et al.  Applications of InterPro in Protein Annotation and Genome Analysis , 2002, Briefings Bioinform..

[26]  S Dimitrov,et al.  Probabilistic assessment of biodegradability based on metabolic pathways: CATABOL System , 2002, SAR and QSAR in environmental research.

[27]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[28]  Jian Zhang,et al.  The Protein Information Resource: an integrated public resource of functional annotation of proteins , 2002, Nucleic Acids Res..

[29]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[30]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[31]  Ulf Leser,et al.  EDITtoTrEMBL: A distributed approach to high-quality automated protein sequence annotation , 1999, German Conference on Bioinformatics.

[32]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[33]  Gilles Klopman,et al.  META, 3. A Genetic Algorithm for Metabolic Transform Priorities Optimization , 1997, J. Chem. Inf. Comput. Sci..

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[36]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.