BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes

Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g. from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue towards new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset, and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.

[1]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[2]  N. Kamada,et al.  Host-microbial Cross-talk in Inflammatory Bowel Disease , 2017, Immune network.

[3]  Kai Wang,et al.  Protective Effects of Salvianolic Acid A against Dextran Sodium Sulfate-Induced Acute Colitis in Rats , 2018, Nutrients.

[4]  J. Hein,et al.  Using biological networks to search for interacting loci in genome-wide association studies , 2009, European Journal of Human Genetics.

[5]  Guangjun Yu,et al.  Characteristics of Faecal Microbiota in Paediatric Crohn’s Disease and Their Dynamic Changes During Infliximab Therapy , 2018, Journal of Crohn's & colitis.

[6]  M. Inouye,et al.  Microbial Factors Associated with Postoperative Crohn’s Disease Recurrence , 2017, Journal of Crohn's & Colitis.

[7]  C. Huttenhower,et al.  Gut microbiome structure and metabolic activity in inflammatory bowel disease , 2018, Nature Microbiology.

[8]  Jos Boekhorst,et al.  Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? , 2012, Briefings Bioinform..

[9]  C. Ciacci,et al.  Anhedonia in irritable bowel syndrome and in inflammatory bowel diseases and its relationship with abdominal pain , 2019, Neurogastroenterology and motility : the official journal of the European Gastrointestinal Motility Society.

[10]  Shaohua Zhao,et al.  Using machine learning to predict antimicrobial minimum inhibitory concentrations and associated genomic features for nontyphoidal Salmonella , 2018, bioRxiv.

[11]  Tao-Tao Liu,et al.  Parasutterella, in association with irritable bowel syndrome and intestinal chronic inflammation , 2018, Journal of gastroenterology and hepatology.

[12]  Andreas Ziegler,et al.  Do little interactions get lost in dark random forests? , 2016, BMC Bioinformatics.

[13]  D. Huso,et al.  Enterotoxigenic Bacteroides fragilis: A potential instigator of colitis , 2007, Inflammatory bowel diseases.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[16]  Jennifer M. Fettweis,et al.  The Integrative Human Microbiome Project , 2019, Nature.

[17]  Mohammad Shaheryar Furqan,et al.  Inference of biological networks using Bi-directional Random Forest Granger causality , 2016, SpringerPlus.

[18]  Daniel Neagu,et al.  Interpreting random forest classification models using a feature contribution method , 2013, IRI.

[19]  Daniel Neagu,et al.  Interpreting random forest models using a feature contribution method , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[20]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[21]  Houtao Deng,et al.  Interpreting tree ensembles with inTrees , 2018, International Journal of Data Science and Analytics.

[22]  C. Brodley,et al.  Decision tree classification of land cover from remotely sensed data , 1997 .

[23]  Abdelaziz Berrado,et al.  Interpretable regularized class association rules algorithm for classification in a categorical data space , 2019, Inf. Sci..

[24]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[25]  Aleksandra A. Kolodziejczyk,et al.  Dysbiosis and the immune system , 2017, Nature Reviews Immunology.

[26]  Rafael A. Irizarry,et al.  Meta-analysis of gut microbiome studies identifies disease-specific and shared responses , 2017, Nature Communications.

[27]  Dan Knights,et al.  Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks , 2019, GigaScience.

[28]  D. Laukens,et al.  Butyrate-producing bacteria supplemented in vitro to Crohn’s disease patient microbiota increased butyrate production and enhanced intestinal epithelial barrier integrity , 2017, Scientific Reports.

[29]  Kyung-Ah Sohn,et al.  Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure , 2014, Comput. Biol. Chem..

[30]  Gilles Louppe,et al.  Understanding Random Forests , 2015 .

[31]  M. Blaut,et al.  Role of commensal gut bacteria in inflammatory bowel diseases , 2012, Gut microbes.

[32]  Line H. Clemmensen,et al.  Forest Floor Visualizations of Random Forests , 2016, ArXiv.

[33]  Wei Wang,et al.  MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. , 2019, Methods.

[34]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.

[35]  Dongmei Ai,et al.  Using Decision Tree Aggregation with Random Forest Model to Identify Gut Microbes Associated with Colorectal Cancer , 2019, Genes.

[36]  A. Knudson Mutation and cancer: statistical study of retinoblastoma. , 1971, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Bernard M. Corfe,et al.  Dysbiosis of the gut microbiota in disease , 2015, Microbial ecology in health and disease.

[38]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[39]  K. Hashimoto,et al.  Key role of gut microbiota in anhedonia-like phenotype in rodents with neuropathic pain , 2019, Translational Psychiatry.

[40]  Cuong Nguyen,et al.  Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic , 2013 .

[41]  Truyen Tran,et al.  Deep in the Bowel: Highly Interpretable Neural Encoder-Decoder Networks Predict Gut Metabolites from Gut Microbiome , 2019, BMC Genomics.