How should we measure proportionality on relative gene expression data?

Correlation is ubiquitously used in gene expression analysis although its validity as an objective criterion is often questionable. If no normalization reflecting the original mRNA counts in the cells is available, correlation between genes becomes spurious. Yet the need for normalization can be bypassed using a relative analysis approach called log-ratio analysis. This approach can be used to identify proportional gene pairs, i.e. a subset of pairs whose correlation can be inferred correctly from unnormalized data due to their vanishing log-ratio variance. To interpret the size of non-zero log-ratio variances, a proposal for a scaling with respect to the variance of one member of the gene pair was recently made by Lovell et al. Here we derive analytically how spurious proportionality is introduced when using a scaling. We base our analysis on a symmetric proportionality coefficient (briefly mentioned in Lovell et al.) that has a number of advantages over their statistic. We show in detail how the choice of reference needed for the scaling determines which gene pairs are identified as proportional. We demonstrate that using an unchanged gene as a reference has huge advantages in terms of sensitivity. We also explore the link between proportionality and partial correlation and derive expressions for a partial proportionality coefficient. A brief data-analysis part puts the discussed concepts into practice.

[1]  Sara Taskinen,et al.  Robust estimation and inference for bivariate line‐fitting in allometry , 2011, Biometrical journal. Biometrische Zeitschrift.

[2]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[3]  Michael Greenacre,et al.  Biplots in Practice , 2009 .

[4]  J. Aitchison,et al.  Biplots of Compositional Data , 2002 .

[5]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[6]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[7]  K. Gerald van den Boogaart,et al.  Analyzing Compositional Data with R , 2013 .

[8]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[9]  Manfred Kraft,et al.  On "User's Guide to Ratio Variables , 1987 .

[10]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[11]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[12]  Jack P. Gibbs,et al.  User's Guide to Ratio Variables , 1985 .

[13]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[14]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[15]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[16]  David R. Lovell,et al.  Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right? , 2011 .

[17]  P. Filzmoser,et al.  Correlation Analysis for Compositional Data , 2009 .

[18]  R. Aebersold,et al.  Quantitative Analysis of Fission Yeast Transcriptomes and Proteomes in Proliferating and Quiescent Cells , 2012, Cell.

[19]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[20]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .