eXplainable Artificial Intelligence (XAI) for the identification of biologically relevant gene expression patterns in longitudinal human studies, insights from obesity research

Until date, several machine learning approaches have been proposed for the dynamic modeling of temporal omics data. Although they have yielded impressive results in terms of model accuracy and predictive ability, most of these applications are based on “Black-box” algorithms and more interpretable models have been claimed by the research community. The recent eXplainable Artificial Intelligence (XAI) revolution offers a solution for this issue, were rule-based approaches are highly suitable for explanatory purposes. The further integration of the data mining process along with functional-annotation and pathway analyses is an additional way towards more explanatory and biologically soundness models. In this paper, we present a novel rule-based XAI strategy (including pre-processing, knowledge-extraction and functional validation) for finding biologically relevant sequential patterns from longitudinal human gene expression data (GED). To illustrate the performance of our pipeline, we work on in vivo temporal GED collected within the course of a long-term dietary intervention in 57 subjects with obesity (GSE77962). As validation populations, we employ three independent datasets following the same experimental design. As a result, we validate primarily extracted gene patterns and prove the goodness of our strategy for the mining of biologically relevant gene-gene temporal relations. Our whole pipeline has been gathered under open-source software and could be easily extended to other human temporal GED applications.

[1]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[2]  Arpad Kelemen,et al.  Dynamic modeling and network approaches for omics time course data: overview of computational approaches and applications , 2018, Briefings Bioinform..

[3]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[4]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[5]  K. Clément,et al.  Adipose tissue transcriptomic signature highlights the pathological relevance of extracellular matrix in human obesity , 2008, Genome Biology.

[6]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[7]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[8]  M. Hayden,et al.  ABCA1 in adipocytes regulates adipose tissue lipid content, glucose tolerance, and insulin sensitivity[S] , 2014, Journal of Lipid Research.

[9]  J Runge,et al.  Causal network reconstruction from time series: From theoretical assumptions to practical estimation. , 2018, Chaos.

[10]  Henrik Jeldtoft Jensen,et al.  Quantifying ‘Causality’ in Complex Systems: Understanding Transfer Entropy , 2013, PloS one.

[11]  Jesús S. Aguilar-Ruiz,et al.  Gene association analysis: a survey of frequent pattern mining from gene expression data , 2010, Briefings Bioinform..

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Nadia J. T. Roumans,et al.  Adipose tissue gene expression is differentially regulated with different rates of weight loss in overweight and obese humans , 2017, International Journal of Obesity.

[14]  Zhenran Jiang,et al.  Using gene networks to drug target identification , 2005, J. Integr. Bioinform..

[15]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[16]  Nadia J. T. Roumans,et al.  The effect of rate of weight loss on long‐term weight regain in adults with overweight and obesity , 2016, Obesity.

[17]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[18]  Dongmei Li,et al.  Bon-EV: an improved multiple testing procedure for controlling false discovery rates , 2017, BMC Bioinformatics.

[19]  Othman Soufan,et al.  NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis , 2019, Nucleic Acids Res..

[20]  Hyojin Kim,et al.  TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions , 2017, Nucleic Acids Res..

[21]  Wei Wu,et al.  Accurate prediction of protein relative solvent accessibility using a balanced model , 2017, BioData Mining.

[22]  Doheon Lee,et al.  Identification of temporal association rules from time-series microarray data sets , 2009, DTMBIO '08.

[23]  Subhagata Chattopadhyay,et al.  Studying infant mortality rate: a data mining approach , 2011 .

[24]  Jian Huang,et al.  Regularized gene selection in cancer microarray meta-analysis , 2009, BMC Bioinformatics.

[25]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[26]  M. Moreno-Aliaga,et al.  Differential Proinflammatory and Oxidative Stress Response and Vulnerability to Metabolic Syndrome in Habitual High-Fat Young Male Consumers Putatively Predisposed by Their Genetic Background , 2013, International journal of molecular sciences.

[27]  Klaus-Robert Müller,et al.  Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models , 2017, ArXiv.

[28]  Philip S. Yu,et al.  A Survey of Parallel Sequential Pattern Mining , 2018, ACM Trans. Knowl. Discov. Data.

[29]  Reinhard Schneider,et al.  A survey of visualization tools for biological network analysis , 2008, BioData Mining.

[30]  Wei-Po Lee,et al.  Computational methods for discovering gene networks from expression data , 2009, Briefings Bioinform..

[31]  K. Giacomini,et al.  SLC transporters as therapeutic targets: emerging opportunities , 2015, Nature Reviews Drug Discovery.

[32]  Norbert Gretz,et al.  TTCA: an R package for the identification of differentially expressed genes in time course microarray data , 2017, BMC Bioinformatics.

[33]  Zhi-Qin John Xu,et al.  Granger Causality Network Reconstruction of Conductance-Based Integrate-and-Fire Neuronal Systems , 2014, PloS one.

[34]  A. Brazma,et al.  Towards reconstruction of gene networks from expression data by supervised learning , 2003, Genome Biology.

[35]  Matthew A. Hibbs,et al.  Visualization of omics data for systems biology , 2010, Nature Methods.

[36]  Edward H. Shortliffe,et al.  A model of inexact reasoning in medicine , 1990 .

[37]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[38]  Catarina Costa,et al.  YEASTRACT: an upgraded database for the analysis of transcription regulatory networks in Saccharomyces cerevisiae , 2017, Nucleic Acids Res..

[39]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[40]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[41]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[42]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[43]  D UllmanJeffrey,et al.  Dynamic itemset counting and implication rules for market basket data , 1997 .

[44]  L. de las Fuentes,et al.  Effects of Moderate and Subsequent Progressive Weight Loss on Metabolic Function and Adipose Tissue Biology in Humans with Obesity. , 2016, Cell metabolism.

[45]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[46]  Elizabeth Theusch,et al.  HNRNPA1 regulates HMGCR alternative splicing and modulates cellular cholesterol metabolism. , 2014, Human molecular genetics.

[47]  Fedor A. Kolpakov,et al.  GTRD: a database on gene transcription regulation—2019 update , 2018, Nucleic Acids Res..

[48]  Eleazar Eskin,et al.  "Good enough solutions" and the genetics of complex diseases. , 2012, Circulation research.

[49]  Peter Möller,et al.  ADAM12 induces actin cytoskeleton and extracellular matrix reorganization during early adipocyte differentiation by regulating β1 integrin function , 2003, Journal of Cell Science.

[50]  Philippe Fournier-Viger,et al.  A Survey of High Utility Sequential Pattern Mining , 2019, Studies in Big Data.

[51]  Xiang-Dong Fu,et al.  Regulation of splicing by SR proteins and SR protein-specific kinases , 2013, Chromosoma.

[52]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[53]  Martina Kutmon,et al.  Profiling Cellular Processes in Adipose Tissue during Weight Loss Using Time Series Gene Expression , 2018, Genes.

[54]  Yutong Lai,et al.  TGCnA: temporal gene coexpression network analysis using a low-rank plus sparse framework , 2019, Journal of applied statistics.

[55]  Nasser Ghadiri,et al.  A review of network‐based approaches to drug repositioning , 2018, Briefings Bioinform..

[56]  Engelbert Mephu Nguifo,et al.  CMRules: Mining sequential rules common to several sequences , 2012, Knowl. Based Syst..

[57]  Vincent S. Tseng,et al.  Mining differential top-k co-expression patterns from time course comparative gene expression datasets , 2013, BMC Bioinformatics.

[58]  Guy Karlebach,et al.  Modelling and analysis of gene regulatory networks , 2008, Nature Reviews Molecular Cell Biology.

[59]  Jessica Andrea Carballido,et al.  Discretization of gene expression data revised , 2016, Briefings Bioinform..

[60]  A. Rissanen,et al.  Subcutaneous adipose tissue gene expression and DNA methylation respond to both short- and long-term weight loss , 2018, International Journal of Obesity.

[61]  J. Newman,et al.  Notch3 is involved in adipogenesis of human adipose-derived stromal/stem cells. , 2018, Biochimie.

[62]  Daniel Sánchez,et al.  Measuring the accuracy and interest of association rules: A new framework , 2002, Intell. Data Anal..

[63]  Ram D. Sriram,et al.  Modeling, validation and verification of three-dimensional cell-scaffold contacts from terabyte-sized images , 2017, BMC Bioinformatics.

[64]  Arpad Kelemen,et al.  Computational dynamic approaches for temporal omics data with applications to systems medicine , 2017, BioData Mining.

[65]  Pierre Saintigny,et al.  Non-canonical NOTCH3 signalling limits tumour angiogenesis , 2017, Nature Communications.

[66]  Danny Holten,et al.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[67]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[68]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[69]  Leif Groop,et al.  Differential gene expression in adipose tissue from obese human subjects during weight loss and weight maintenance. , 2012, The American journal of clinical nutrition.

[70]  Alexander Quarshie,et al.  Obesity induced a leptin‐Notch signaling axis in breast cancer , 2014, International journal of cancer.