Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection

High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance. ABSTRACT Since the turn of the century, technological advances have made it possible to obtain the molecular profile of any tissue in a cost-effective manner. Among these advances are sophisticated high-throughput assays that measure the relative abundances of microorganisms, RNA molecules, and metabolites. While these data are most often collected to gain new insights into biological systems, they can also be used as biomarkers to create clinically useful diagnostic classifiers. How best to classify high-dimensional -omics data remains an area of active research. However, few explicitly model the relative nature of these data and instead rely on cumbersome normalizations. This report (i) emphasizes the relative nature of health biomarkers, (ii) discusses the literature surrounding the classification of relative data, and (iii) benchmarks how different transformations perform for regularized logistic regression across multiple biomarker types. We show how an interpretable set of log contrasts, called balances, can prepare data for classification. We propose a simple procedure, called discriminative balance analysis, to select groups of 2 and 3 bacteria that can together discriminate between experimental conditions. Discriminative balance analysis is a fast, accurate, and interpretable alternative to data normalization. IMPORTANCE High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance.

[1]  A. Goday,et al.  The Gut Microbiome Profile in Obesity: A Systematic Review , 2018, International journal of endocrinology.

[2]  Eric A. Franzosa,et al.  Gut microbiome structure and metabolic activity in inflammatory bowel disease , 2018, Nature Microbiology.

[3]  V. Pawlowsky-Glahn,et al.  Advances in Principal Balances for Compositional Data , 2018, Mathematical Geosciences.

[4]  Gordon M Miskelly,et al.  Compositional data analysis for elemental data in forensic science. , 2009, Forensic science international.

[5]  J. Aitchison,et al.  Logratio Analysis and Compositional Distance , 2000 .

[6]  Thomas P Quinn,et al.  Visualizing balances of compositional data: A new alternative to balance dendrograms , 2018, F1000Research.

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[9]  R. Tolosana-Delgado Uses and misuses of compositional data in sedimentology , 2012 .

[10]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[11]  R. Shamir,et al.  Expression and methylation patterns partition luminal-A breast tumors into distinct prognostic subgroups , 2016, Breast cancer research : BCR.

[12]  P. Bork,et al.  Gut Microbiota Linked to Sexual Preference and HIV Infection , 2016, EBioMedicine.

[13]  Jose A Navas-Molina,et al.  Balance Trees Reveal Microbial Niche Differentiation , 2017, mSystems.

[14]  A. Wood,et al.  A data-based power transformation for compositional data , 2011, 1106.1451.

[15]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[16]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[17]  Thomas P. Quinn,et al.  Differential proportionality –a normalization-free approach to differential gene expression , 2017, bioRxiv.

[18]  S. Shen,et al.  The statistical analysis of compositional data , 1983 .

[19]  Vera Pawlowsky-Glahn,et al.  It's all relative: analyzing microbiome data as compositions. , 2016, Annals of epidemiology.

[20]  B Walczak,et al.  What can go wrong at the data normalization step for identification of biomarkers? , 2014, Journal of chromatography. A.

[21]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[22]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[23]  Lawrence A. David,et al.  A phylogenetic transform enhances analysis of compositional microbiota data , 2016, bioRxiv.

[24]  Vera Pawlowsky-Glahn,et al.  Balance-dendrogram. A new routine of CoDaPack , 2008, Comput. Geosci..

[25]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[26]  Thomas P. Quinn,et al.  A field guide for the compositional analysis of any-omics data , 2018, bioRxiv.

[27]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[28]  M. Greenacre Variable Selection in Compositional Data Analysis Using Pairwise Logratios , 2018, Mathematical Geosciences.

[29]  P. Schloss,et al.  Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions , 2016, Genome Medicine.

[30]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[31]  Rafael A. Irizarry,et al.  Meta-analysis of gut microbiome studies identifies disease-specific and shared responses , 2017, Nature Communications.

[32]  William S. Rayens,et al.  Partial least squares and compositional data: problems and alternatives , 1995 .

[33]  Patrick D. Schloss,et al.  Microbiome Data Distinguish Patients with Clostridium difficile Infection and Non-C. difficile-Associated Diarrhea from Healthy Controls , 2014, mBio.

[34]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[35]  Javier Palarea-Albaladejo,et al.  zCompositions — R package for multivariate imputation of left-censored data under a compositional approach , 2015 .

[36]  Gregory B. Gloor,et al.  Compositional uncertainty should not be ignored in high-throughput sequencing data analysis , 2016 .

[37]  P. Bruheim,et al.  Targeted metabolomic analysis of plasma samples for the diagnosis of inherited metabolic disorders. , 2012, Journal of chromatography. A.

[38]  V. Pawlowsky-Glahn,et al.  Groups of Parts and Their Balances in Compositional Data Analysis , 2005 .

[39]  Henry Han,et al.  How does normalization impact RNA-seq disease diagnosis? , 2018, J. Biomed. Informatics.

[40]  Thomas Quinn,et al.  exprso: an R-package for the rapid implementation of machine learning algorithms , 2016, F1000Research.

[41]  R. Paredes,et al.  Balances: a New Perspective for Microbiome Analysis , 2017, mSystems.

[42]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[43]  K. Gerald van den Boogaart,et al.  Analyzing Compositional Data with R , 2013 .

[44]  Karsten Zengler,et al.  A Novel Sparse Compositional Technique Reveals Microbial Perturbations , 2019, mSystems.

[45]  P. Filzmoser,et al.  PLS‐DA for compositional data with application to metabolomics , 2015 .

[46]  K. Gerald van den Boogaart,et al.  Descriptive Analysis of Compositional Data , 2013 .

[47]  Peter Filzmoser,et al.  Robust biomarker identification in a two-class problem based on pairwise log-ratios , 2017 .

[48]  K. Gerald van den Boogaart,et al.  Fundamental Concepts of Compositional Data Analysis , 2013 .

[49]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[50]  Lawrence A. David,et al.  Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets , 2017, PeerJ.

[51]  Michele Gallo,et al.  Discriminant partial least squares analysis on compositional data , 2010 .

[52]  Jean M. Macklaim,et al.  Finding the centre: corrections for asymmetry in high-throughput sequencing datasets , 2017, 1704.01841.

[53]  V. Pawlowsky-Glahn,et al.  Exploring Compositional Data with the CoDa-Dendrogram , 2011 .