Rule Mining Techniques to Predict Prokaryotic Metabolic Pathways.

It is becoming more evident that computational methods are needed for the identification and the mapping of pathways in new genomes. We introduce an automatic annotation system (ARBA4Path Association Rule-Based Annotator for Pathways) that utilizes rule mining techniques to predict metabolic pathways across wide range of prokaryotes. It was demonstrated that specific combinations of protein domains (recorded in our rules) strongly determine pathways in which proteins are involved and thus provide information that let us very accurately assign pathway membership (with precision of 0.999 and recall of 0.966) to proteins of a given prokaryotic taxon. Our system can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown function. The prediction models are represented in the form of human-readable rules, and they can be used effectively to add absent pathway information to many proteins in UniProtKB/TrEMBL database.

[1]  Robert Hoehndorf,et al.  Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining , 2016, PloS one.

[2]  Engelbert Mephu Nguifo,et al.  Ranking and Selecting Association Rules Based on Dominance Relationship , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[3]  Olivier Bodenreider,et al.  Non-Lexical Approaches to Identifying Associative Relations in the Gene Ontology , 2004, Pacific Symposium on Biocomputing.

[4]  Mikhail S. Gelfand,et al.  Mining sequence annotation databanks for association patterns , 2005, Bioinform..

[5]  Philip N. Judson,et al.  Using Absolute and Relative Reasoning in the Prediction of the Potential Metabolism of Xenobiotics , 2003, J. Chem. Inf. Comput. Sci..

[6]  M. Boulton,et al.  Activation of the Wnt pathway plays a pathogenic role in diabetic retinopathy in humans and animal models. , 2009, The American journal of pathology.

[7]  Lynda B. M. Ellis,et al.  Encoding microbial metabolic logic: predicting biodegradation , 2004, Journal of Industrial Microbiology & Biotechnology.

[8]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[9]  Miles Parkes,et al.  Genetic insights into common pathways and complex relationships among immune-mediated diseases , 2013, Nature Reviews Genetics.

[10]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[11]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[12]  Roded Sharan,et al.  Large-Scale Elucidation of Drug Response Pathways in Humans , 2012, J. Comput. Biol..

[13]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[14]  Elisabeth Coudert,et al.  HAMAP in 2013, new developments in the protein family classification and annotation system , 2012, Nucleic Acids Res..

[15]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[16]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[17]  Jian Zhang,et al.  The Protein Information Resource: an integrated public resource of functional annotation of proteins , 2002, Nucleic Acids Res..

[18]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[21]  Xia Li,et al.  A sub-pathway-based approach for identifying drug response principal network , 2011, Bioinform..

[22]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[23]  Ulf Leser,et al.  EDITtoTrEMBL: A distributed approach to high-quality automated protein sequence annotation , 1999, German Conference on Bioinformatics.

[24]  Stefan Kramer,et al.  Analyzing microarray data using quantitative association rules , 2005, ECCB/JBI.

[25]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[26]  Engelbert Mephu Nguifo,et al.  Mining Undominated Association Rules Through Interestingness Measures , 2014, Int. J. Artif. Intell. Tools.

[27]  S Dimitrov,et al.  Probabilistic assessment of biodegradability based on metabolic pathways: CATABOL System , 2002, SAR and QSAR in environmental research.

[28]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[29]  Peter D. Karp,et al.  The Pathway Tools Pathway Prediction Algorithm , 2011, Standards in genomic sciences.

[30]  Gilles Klopman,et al.  META, 3. A Genetic Algorithm for Metabolic Transform Priorities Optimization , 1997, J. Chem. Inf. Comput. Sci..

[31]  Rolf Apweiler,et al.  Applications of InterPro in Protein Annotation and Genome Analysis , 2002, Briefings Bioinform..

[32]  Peter D. Karp,et al.  The EcoCyc and MetaCyc databases , 2000, Nucleic Acids Res..

[33]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  Rachael P. Huntley,et al.  Standardized description of scientific evidence using the Evidence Ontology (ECO) , 2014, Database J. Biol. Databases Curation.

[36]  Eli Upfal,et al.  An efficient rigorous approach for identifying statistically significant frequent itemsets , 2009, JACM.

[37]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .