Finding the Centre: Compositional Asymmetry in High-Throughput Sequencing Datasets

High-throughput sequencing datasets comprise millions of reads of genomic data and can be modelled as count compositions. These data are used for transcription profiles, microbial diversity, or relative cellular abundance in culture. The data are sparse and high dimensional. Moreover, they are often unbalanced, i.e. there is often systematic variation between groups due to presence or absence of features, and this variation is important to the biological interpretation of the data. The imbalance causes samples in the comparison groups to exhibit varying centres contributing to false positive and false negative identifications. Here, we extend the centred log-ratio transformation method used for the comparison of differential relative abundance between two groups in a Bayesian compositional context. We demonstrate the pathology in modelled and real unbalanced experimental designs to show how this causes both false negative and false positive inference. We examined four approaches to identify denominator features, and tested them with different proportions of modelled asymmetry; two were relatively robust, and recommended. We recommend the ‘LVHA’ transformation for asymmetric transcriptome datasets, and the ‘IQLR’ method for all other datasets when using the ALDEx2 tool available on Bioconductor.

[1]  Jean M. Macklaim,et al.  Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis , 2013, Microbiome.

[2]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[3]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[4]  T. Hwa,et al.  Interdependence of Cell Growth and Gene Expression: Origins and Consequences , 2010, Science.

[5]  Gregory B. Gloor,et al.  Compositional uncertainty should not be ignored in high-throughput sequencing data analysis , 2016 .

[6]  Robert D. Finn,et al.  EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies , 2017, Nucleic Acids Res..

[7]  Javier Palarea-Albaladejo,et al.  zCompositions — R package for multivariate imputation of left-censored data under a compositional approach , 2015 .

[9]  T. Johnson,et al.  Transcriptome modulations due to A/C2 plasmid acquisition. , 2015, Plasmid.

[10]  Christoph Abels,et al.  Metatranscriptome Analysis of the Vaginal Microbiota Reveals Potential Mechanisms for Protection against Metronidazole in Bacterial Vaginosis , 2018, mSphere.

[11]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[12]  Wolfgang Huber,et al.  Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15: 550 , 2014 .

[13]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[14]  Zaid Abdo,et al.  Temporal Dynamics of the Human Vaginal Microbiota , 2012, Science Translational Medicine.

[15]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[16]  F. Speleman,et al.  Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes , 2002, Genome Biology.

[17]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[18]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[19]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[20]  Jean M. Macklaim,et al.  ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq , 2013, PloS one.

[21]  Jean M. Macklaim,et al.  Changes in vaginal microbiota following antimicrobial and probiotic therapy , 2015, Microbial ecology in health and disease.

[22]  T. Grisar,et al.  Housekeeping genes as internal standards: use and limits. , 1999, Journal of biotechnology.

[23]  Robert D. Finn,et al.  A new genomic blueprint of the human gut microbiota , 2019, Nature.

[24]  Gregory B. Gloor,et al.  The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Healthy Young , 2017, mSphere.

[25]  Gregory B Gloor,et al.  ! 1 ! A coevolutionary barrier constrains active site variation in LAGLIDADG homing endonucleases , 2014 .

[26]  Christian Cole,et al.  Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment , 2015, Bioinform..

[27]  Gregory B. Gloor,et al.  Deep Sequencing of the Vaginal Microbiota of Women with HIV , 2010, PloS one.

[28]  Jean M. Macklaim,et al.  From RNA-seq to Biological Inference: Using Compositional Data Analysis in Meta-Transcriptomics. , 2018, Methods in molecular biology.

[29]  David R. Lovell,et al.  Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences , 2020, NAR genomics and bioinformatics.

[30]  R. Lan,et al.  Global Transcriptional and Phenotypic Analyses of Escherichia coli O157:H7 Strain Xuzhou21 and Its pO157_Sal Cured Mutant , 2013, PloS one.

[31]  Karsten Zengler,et al.  A Novel Sparse Compositional Technique Reveals Microbial Perturbations , 2019, mSystems.

[32]  P. Gajer,et al.  Vaginal microbiome of reproductive-age women , 2010, Proceedings of the National Academy of Sciences.

[33]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[34]  Gregory B. Gloor,et al.  Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes , 2016 .

[35]  Baohai Hao,et al.  RNA-Seq and Microarrays Analyses Reveal Global Differential Transcriptomes of Mesorhizobium huakuii 7653R between Bacteroids and Free-Living Cells , 2014, PloS one.

[36]  Douglas G. Altman,et al.  Measurement in Medicine: The Analysis of Method Comparison Studies , 1983 .

[37]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[38]  Jean M. Macklaim,et al.  Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis , 2014, Microbiome.

[39]  G. Gloor,et al.  Biasing genome-editing events toward precise length deletions with an RNA-guided TevCas9 dual nuclease , 2016, Proceedings of the National Academy of Sciences.

[40]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..