Compositional uncertainty should not be ignored in high-throughput sequencing data analysis

High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing.

[1]  David J. Edwards,et al.  Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data , 2012, PloS one.

[2]  V. Pawlowsky-Glahn,et al.  Modeling and Analysis of Compositional Data , 2015 .

[3]  P. Filzmoser,et al.  Bayesian-multiplicative treatment of count zeros in compositional data sets , 2015 .

[4]  Hisashi Kobayashi,et al.  Modeling and analysis , 1978 .

[5]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Jean M. Macklaim,et al.  ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq , 2013, PloS one.

[8]  A. van Oudenaarden,et al.  Using Gene Expression Noise to Understand Gene Regulation , 2012, Science.

[9]  David R. Lovell,et al.  Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right? , 2011 .

[10]  Gregory B Gloor,et al.  ! 1 ! A coevolutionary barrier constrains active site variation in LAGLIDADG homing endonucleases , 2014 .

[11]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[12]  M. Stephens,et al.  , comparison with gene expression arrays RNA-seq : An assessment of technical reproducibility and data , 2008 .

[13]  Javier Palarea-Albaladejo,et al.  zCompositions — R package for multivariate imputation of left-censored data under a compositional approach , 2015 .

[14]  Jean M. Macklaim,et al.  Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis , 2013, Microbiome.

[15]  P. Schloss,et al.  Dynamics and associations of microbial community types across the human body , 2014, Nature.

[16]  Jean M. Macklaim,et al.  Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis , 2014, Microbiome.

[17]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[18]  F. Luciani High-throughput sequencing and vaccine design. , 2016, Revue scientifique et technique.

[19]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[20]  Christian Cole,et al.  Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment , 2015, Bioinform..

[21]  Raimon Tolosana-Delgado,et al.  "compositions": A unified R package to analyze compositional data , 2008, Comput. Geosci..

[22]  J. Petrosino,et al.  Microbiota Modulate Behavioral and Physiological Abnormalities Associated with Neurodevelopmental Disorders , 2013, Cell.

[23]  G. Gloor,et al.  High throughput sequencing methods and analysis for microbiome research. , 2013, Journal of microbiological methods.

[24]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[25]  A. Maxwell,et al.  A strand-passage conformation of DNA gyrase is required to allow the bacterial toxin, CcdB, to access its binding site , 2006, Nucleic acids research.