Differential proportionality –a normalization-free approach to differential gene expression

Gene expression data, such as those generated by next generation sequencing technologies (RNA-seq), are of an inherently relative nature: the total number of sequenced reads has no biological meaning. This issue is most often addressed with various normalization techniques which all face the same problem: once information about the total mRNA content of the origin cells is lost, it cannot be recovered by mere technical means. Additional knowledge, in the form of an unchanged reference, is necessary; however, this reference can usually only be estimated. Here we propose a novel method where sample normalization is unnecessary, but important insights can be obtained nevertheless. Instead of trying to recover absolute abundances, our method is entirely based on ratios, so normalization factors cancel by default. Although the differential expression of individual genes cannot be recovered this way, the ratios themselves can be differentially expressed (even when their constituents are not). Yet, most current analyses are blind to these cases, while our approach reveals them directly. Specifically, we show how the differential expression of gene ratios can be formalized by decomposing log-ratio variance (LRV) and deriving intuitive statistics from it. Although small LRVs have been used to detect proportional genes in gene expression data before, we focus here on the change in proportionality factors between groups of samples (e.g. tissue-specific proportionality). For this, we propose a statistic that is equivalent to the squared t-statistic of one-way ANOVA, but for gene ratios. In doing so, we show how precision weights can be incorporated to account for the peculiarities of count data, and, moreover, how a moderated statistic can be derived in the same way as the one following from a hierarchical model for individual genes. We also discuss approaches to deal with zero counts, deriving an expression of our statistic that is able to incorporate them. In providing a detailed analysis of the connections between the differential expression of genes and the differential proportionality of pairs, we facilitate a clear interpretation of new concepts. The proposed framework is applied to a data set from GTEx consisting of 98 samples from the cerebellum and cortex, with selected examples shown. A computationally efficient implementation of the approach in R has been released as an addendum to the propr package.1

[1]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[2]  Cédric Notredame,et al.  How should we measure proportionality on relative gene expression data? , 2016, Theory in Biosciences.

[3]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[4]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[5]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[6]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[7]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[8]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[9]  A. Conesa,et al.  Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package , 2015, Nucleic acids research.

[10]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[11]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[12]  Michael Greenacre,et al.  Power Transformations in Correspondence Analysis , 2007, Comput. Stat. Data Anal..

[13]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[14]  Gal Chechik,et al.  Gene Expression Switching of Receptor Subunits in Human Brain Development , 2015, PLoS Comput. Biol..

[15]  R. Reyment Compositional data analysis , 1989 .

[16]  R. Olea,et al.  Dealing with Zeros , 2011 .

[17]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[18]  Michael Greenacre,et al.  Measuring Subcompositional Incoherence , 2008 .

[19]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[20]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[21]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[22]  Rainer Breitling,et al.  DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules , 2010, BMC Bioinformatics.

[23]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[24]  Jean M. Macklaim,et al.  ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq , 2013, PloS one.

[25]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[26]  Lydia Ng,et al.  Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system , 2012, Nucleic Acids Res..

[27]  Marie-Liesse Asselin-Labat,et al.  Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses , 2015, Nucleic acids research.

[28]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[29]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[30]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.