Understanding sequencing data as compositions: an outlook and review

Motivation Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g., gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e., library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that renders invalid many conventional analyses, including distance measures, correlation coefficients, and multivariate statistical models. Results The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.

[1]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[2]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[3]  Michael Greenacre,et al.  Measuring Subcompositional Incoherence , 2008 .

[4]  G. Barton,et al.  How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? , 2015, RNA.

[5]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[6]  Gregory R. Grant,et al.  Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data , 2015, Bioinform..

[7]  Thomas P. Quinn,et al.  Differential proportionality –a normalization-free approach to differential gene expression , 2017, bioRxiv.

[8]  J. Aitchison,et al.  Logratio Analysis and Compositional Distance , 2000 .

[9]  Nuno A. Fonseca,et al.  RNA-Seq Gene Profiling - A Systematic Empirical Comparison , 2014, bioRxiv.

[10]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[11]  Tianwei Yu,et al.  Capturing changes in gene expression dynamics by gene set differential coordination analysis. , 2011, Genomics.

[12]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[13]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[14]  Cédric Notredame,et al.  How should we measure proportionality on relative gene expression data? , 2016, Theory in Biosciences.

[15]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[16]  Marshall Nichols,et al.  Comparing reference-based RNA-Seq mapping methods for non-human primate data , 2014, BMC Genomics.

[17]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[18]  Caroline C. Friedel,et al.  A Comprehensive Evaluation of Alignment Algorithms in the Context of RNA-Seq , 2012, PloS one.

[19]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[20]  Laura L. Elo,et al.  Comparison of software packages for detecting differential expression in RNA-seq studies , 2013, Briefings Bioinform..

[21]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[22]  Antti Honkela,et al.  Analysis of differential splicing suggests different modes of short-term splicing regulation , 2016, Bioinform..

[23]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[24]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[25]  T. Hwa,et al.  Interdependence of Cell Growth and Gene Expression: Origins and Consequences , 2010, Science.

[26]  Dmitri D. Pervouchine,et al.  A benchmark for RNA-seq quantification pipelines , 2016, Genome Biology.

[27]  K. Gerald van den Boogaart,et al.  Zeroes, Missings, and Outliers , 2013 .

[28]  Jean M. Macklaim,et al.  A multi-platform metabolomics approach identifies highly specific biomarkers of bacterial diversity in the vagina of pregnant and non-pregnant women , 2015, Scientific Reports.

[29]  Michael Greenacre,et al.  Towards a pragmatic approach to compositional data analysis , 2017 .

[30]  Geet Duggal,et al.  Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference , 2017, Nature Methods.

[31]  R. Reyment Compositional data analysis , 1989 .

[32]  K. Gerald van den Boogaart,et al.  Descriptive Analysis of Compositional Data , 2013 .

[33]  K. Gerald van den Boogaart,et al.  Fundamental Concepts of Compositional Data Analysis , 2013 .

[34]  Antonella Buccianti,et al.  Is compositional data analysis a way to see beyond the illusion? , 2013, Comput. Geosci..

[35]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[36]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[37]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[38]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[39]  David Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017 .

[40]  G. Gloor,et al.  Human milk microbiota profiles in relation to birthing method, gestation and infant gender , 2016, Microbiome.

[41]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[42]  Gregory B. Gloor,et al.  The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Healthy Young , 2017, mSphere.

[43]  Jean M. Macklaim,et al.  ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq , 2013, PloS one.

[44]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[45]  John Aitchison,et al.  Log-ratios and geochemical discrimination of Scottish Dalradian limestones: a case study , 2006, Geological Society Special Publication.

[46]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[47]  John Aitchison,et al.  The single principle of compositional data analysis, continuing fallacies, confusionsand misunderstandings and some suggested remedies , 2008 .

[48]  E. Saccenti Correlation Patterns in Experimental Data Are Affected by Normalization Procedures: Consequences for Data Analysis and Network Inference. , 2017, Journal of proteome research.

[49]  Michael Greenacre,et al.  Power Transformations in Correspondence Analysis , 2007, Comput. Stat. Data Anal..

[50]  M. Salit,et al.  Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .

[51]  C. I. Bliss,et al.  FITTING THE NEGATIVE BINOMIAL DISTRIBUTION TO BIOLOGICAL DATA AND NOTE ON THE EFFICIENT FITTING OF THE NEGATIVE BINOMIAL , 1953 .

[52]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[53]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[54]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[55]  Rob Knight,et al.  Analysis of composition of microbiomes: a novel method for studying microbial composition , 2015, Microbial ecology in health and disease.

[56]  Eun Ji Kim,et al.  Simulation-based comprehensive benchmarking of RNA-seq aligners , 2016, Nature Methods.

[57]  J. Aitchison,et al.  Biplots of Compositional Data , 2002 .

[58]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[59]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[60]  A. Conesa,et al.  Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package , 2015, Nucleic acids research.

[61]  Jean M. Macklaim,et al.  Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis , 2014, Microbiome.

[62]  Shun Liu,et al.  Discovery of Protein–lncRNA Interactions by Integrating Large-Scale CLIP-Seq and RNA-Seq Datasets , 2015, Front. Bioeng. Biotechnol..

[63]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[64]  Charles C. Kim,et al.  Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq , 2016, BMC Bioinformatics.

[65]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[66]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[67]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[68]  Josep Antoni Martín-Fernández,et al.  Rounded zeros: some practical aspects for compositional data , 2006, Geological Society, London, Special Publications.

[69]  G. Mateu-Figueras,et al.  The Principle of Working on Coordinates , 2011 .

[70]  Obi L. Griffith,et al.  Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud , 2015, PLoS Comput. Biol..

[71]  B. Oliver,et al.  Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster , 2016, BMC Genomics.

[72]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[73]  Raimon Tolosana-Delgado,et al.  "compositions": A unified R package to analyze compositional data , 2008, Comput. Geosci..

[74]  Lawrence A. David,et al.  Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets , 2017, PeerJ.

[75]  Tzu-Pin Lu,et al.  Comparisons and performance evaluations of RNA-seq alignment tools , 2014, 2014 International Conference on Electrical Engineering and Computer Science (ICEECS).

[76]  Elmer Andrés Fernández,et al.  A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies , 2017, bioRxiv.