Data- and expert-driven rule induction and filtering framework for functional interpretation and description of gene sets

BackgroundHigh-throughput methods in molecular biology provided researchers with abundance of experimental data that need to be interpreted in order to understand the experimental results. Manual methods of functional gene/protein group interpretation are expensive and time-consuming; therefore, there is a need to develop new efficient data mining methods and bioinformatics tools that could support the expert in the process of functional analysis of experimental results.ResultsIn this study, we propose a comprehensive framework for the induction of logical rules in the form of combinations of Gene Ontology (GO) terms for functional interpretation of gene sets. Within the framework, we present four approaches: the fully automated method of rule induction without filtering, rule induction method with filtering, expert-driven rule filtering method based on additive utility functions, and expert-driven rule induction method based on the so-called seed or expert terms – the GO terms of special interest which should be included into the description. These GO terms usually describe some processes or pathways of particular interest, which are related to the experiment that is being performed. During the rule induction and filtering processes such seed terms are used as a base on which the description is build.ConclusionWe compare the descriptions obtained with different algorithms of rule induction and filtering and show that a filtering step is required to reduce the number of rules in the output set so that they could be analyzed by a human expert. However, filtering may remove information from the output rule set which is potentially interesting for the expert. Therefore, in the study, we present two methods that involve interaction with the expert during the process of rule induction. Both of them are able to reduce the number of rules, but only in the case of the method based on seed terms, each of the created rule includes expert terms in combination with the other terms. Further analysis of such combinations may provide new knowledge about biological processes and their combination with other pathways related to genes described by the rules. A suite of Matlab scripts that provide the functionality of a comprehensive framework for the rule induction and filtering presented in this study is available free of charge at: http://rulego.polsl.pl/framework.

[1]  Susmita Datta,et al.  Evaluation of clustering algorithms for gene expression data , 2006, BMC Bioinformatics.

[2]  Aleksandra Gruca,et al.  Improvement of FP-Growth Algorithm for Mining Description-Oriented Rules , 2013, ICMMI.

[3]  Herman Midelfart Supervised Learning in the Gene Ontology Part I: A Rough Set Framework , 2005, Trans. Rough Sets.

[4]  Bing Liu,et al.  Generating Classification Rules According to User's Existing Knowledge , 2001, SDM.

[5]  Marek Sikora,et al.  RuleGO: a logical rules-based tool for description of gene groups by means of Gene Ontology , 2011, Nucleic Acids Res..

[6]  Mahmood Rasool,et al.  Molecular genetics of human primary microcephaly: an overview , 2015, BMC Medical Genomics.

[7]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[8]  Peter M Woollard,et al.  The application of next-generation sequencing technologies to drug discovery and development. , 2011, Drug discovery today.

[9]  Alberto D. Pascual-Montano,et al.  GeneCodis3: a non-redundant and modular enrichment analysis tool for functional genomics , 2012, Nucleic Acids Res..

[10]  Fabrice Guillet,et al.  Quality Measures in Data Mining , 2009, Studies in Computational Intelligence.

[11]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[13]  Shusaku Tsumoto,et al.  Comparing Accuracies of Rule Evaluation Models to Determine Human Criteria on Evaluated Rule Sets , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[14]  Daniel Vanderpooten,et al.  Induction of decision rules in classification and discovery-oriented perspectives , 2001, Int. J. Intell. Syst..

[15]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[16]  José María Carazo,et al.  BMC Bioinformatics BioMed Central Methodology article Integrated analysis of gene expression by association rules discovery , 2022 .

[17]  Jerzy Stefanowski,et al.  Argument Based Generalization of MODLEM Rule Induction Algorithm , 2010, RSCTC.

[18]  Michal Kozielski,et al.  Soft Approach to Identification of Cohesive Clusters in Two Gene Representations , 2014, KES.

[19]  Nada Lavrac,et al.  Expert-Guided Subgroup Discovery: Methodology and Application , 2011, J. Artif. Intell. Res..

[20]  Andrzej Polanski,et al.  Structured Bi-clusters Algorithm for Classification of DNA Microarray Data , 2016, ITIB.

[21]  E. V. Van Allen,et al.  Next-generation sequencing to guide cancer therapy , 2015, Genome Medicine.

[22]  Jianqing Fan,et al.  A Computational Approach to the Functional Clustering of Periodic Gene-Expression Profiles , 2008, Genetics.

[23]  Jan Komorowski,et al.  Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers , 2014, BMC Bioinformatics.

[24]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[25]  Jean Pierre Brans,et al.  HOW TO SELECT AND HOW TO RANK PROJECTS: THE PROMETHEE METHOD , 1986 .

[26]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[27]  Lodewyk F. A. Wessels,et al.  A multilevel pan-cancer map links gene mutations to cancer hallmarks , 2015, Chinese journal of cancer.

[28]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[29]  Jan Komorowski,et al.  Predicting gene ontology biological process from temporal gene expression patterns. , 2003, Genome research.

[30]  Ivan Bratko,et al.  Argument-Based Machine Learning , 2006, ISMIS.

[31]  Nick Cercone,et al.  Rule Quality Measures for Rule Induction Systems: Description and Evaluation , 2001, Comput. Intell..

[32]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[33]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[34]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[35]  Mirko Francesconi,et al.  Reconstructing networks of pathways via significance analysis of their intersections , 2008, BMC Bioinformatics.

[36]  Jan Komorowski,et al.  Learning Rule-based Models of Biological Process from Gene Expression Time Profiles Using Gene Ontology , 2003, Bioinform..

[37]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[38]  F. Pépin,et al.  Stromal gene expression predicts clinical outcome in breast cancer , 2008, Nature Medicine.

[39]  Ahmed A. Rafea,et al.  AN INTERACTIVE SYSTEM FOR ASSOCIATION RULE DISCOVERY FOR LIFE ASSURANCE , 2004 .

[40]  Bin Yan,et al.  The Current Status and Challenges in Computational Analysis of Genomic Big Data , 2015, Big Data Res..

[41]  Philippe Lenca,et al.  A Clustering of Interestingness Measures , 2004, Discovery Science.

[42]  Marek Sikora,et al.  Rule Quality Measures in Creation and Reduction of Data Rule Models , 2006, RSCTC.

[43]  G. W. Hatfield,et al.  DNA microarrays and gene expression , 2002 .

[44]  David Meyre,et al.  From big data analysis to personalized medicine for all: challenges and opportunities , 2015, BMC Medical Genomics.

[45]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Marek Sikora,et al.  Rule based functional description of genes – Estimation of the multicriteria rule interestingness measure by the UTA method , 2013 .

[47]  J. Carazo,et al.  GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists , 2007, Genome Biology.

[48]  Fabrice Guillet,et al.  Quality Measures in Data Mining (Studies in Computational Intelligence) , 2007 .

[49]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[50]  Marek Sikora,et al.  Induction and selection of the most interesting Gene Ontology based multiattribute rules for descriptions of gene groups , 2011, Pattern Recognit. Lett..

[51]  Michael Hackenberg,et al.  Annotation-Modules: a tool for finding significant combinations of multisource annotations for gene lists , 2008, Bioinform..

[52]  Francisco-Javier Lopez,et al.  Fuzzy association rules for biological data analysis: A case study on yeast , 2008, BMC Bioinformatics.

[53]  J. Siskos Assessing a set of additive utility functions for multicriteria decision-making , 1982 .