Domain intelligible models.

Mining biological information from rich "-omics" datasets is facilitated by organizing features into groups that are related to a biological phenomenon or clinical outcome. For example, microorganisms can be grouped based on a phylogenetic tree that depicts their similarities regarding genetic or physical characteristics. Here, we describe algorithms that incorporate auxiliary information in terms of groups of predictors and the relationships between them into the metagenome learning task to build intelligible models. In particular, our cost function guides the feature selection process using auxiliary information by requiring related groups of predictors to provide similar contributions to the final response. We apply the developed algorithms to a recently published dataset analyzing the effects of fecal microbiota transplantation (FMT) in order to identify factors that are associated with improved peripheral insulin sensitivity, leading to accurate predictions of the response to the FMT.

[1]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[2]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[3]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[4]  Ramon Rosselló-Móra,et al.  Towards a taxonomy of Bacteria and Archaea based on interactive and cumulative data repositories. , 2012, Environmental microbiology.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[7]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[8]  Sultan Imangaliyev,et al.  Unsupervised Multi-View Feature Selection for Tumor Subtype Identification , 2017, BCB.

[9]  Evgeni Levin,et al.  Improvement of Insulin Sensitivity after Lean Donor Feces in Metabolic Syndrome Is Driven by Baseline Intestinal Microbiota Composition. , 2017, Cell metabolism.

[10]  Trevor Hastie,et al.  A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression , 2013, 1311.6529.

[11]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  Johannes Gehrke,et al.  Accurate intelligible models with pairwise interactions , 2013, KDD.

[14]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[15]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[16]  Klamer Schutte,et al.  Feature Selection via Co-regularized Sparse-Group Lasso , 2016, MOD.

[17]  Trevor Hastie,et al.  Computer Age Statistical Inference: Algorithms, Evidence, and Data Science , 2016 .

[18]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[19]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[20]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[21]  Sultan Imangaliyev,et al.  Discovery of Salivary Gland Tumors' Biomarkers via Co-Regularized Sparse-Group Lasso , 2017, DS.

[22]  Tom Heskes,et al.  Online Co-regularized Algorithms , 2012, Discovery Science.

[23]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[24]  Clarisse Dhaenens,et al.  Optimization and Big Data , 2016 .

[25]  Johannes Gehrke,et al.  Intelligible models for classification and regression , 2012, KDD.