Compositional Covariance Shrinkage and Regularised Partial Correlations

We propose an estimation procedure for covariation in wide compositional data sets. For compositions, widely-used logratio variables are interdependent due to a common reference. Logratio uncorrelated compositions are linearly independent before the unit-sum constraint is imposed. We show how they are used to construct bespoke shrinkage targets for logratio covariance matrices and test a simple procedure for partial correlation estimates on both a simulated and a single-cell gene expression data set. For the underlying counts, different zero imputations are evaluated. The partial correlation induced by the closure is derived analytically. Data and code are available from GitHub.

[1]  C. Thomas-Agnan,et al.  lrSVD: An efficient imputation algorithm for incomplete high‐throughput compositional data , 2022, Journal of Chemometrics.

[2]  J. Deasy,et al.  The maximum entropy principle for compositional data , 2022, bioRxiv.

[3]  M. Greenacre,et al.  Aitchison’s Compositional Data Analysis 40 Years on: A Reappraisal , 2022, Statistical Science.

[4]  A. Riba,et al.  Cell cycle gene regulation dynamics revealed by RNA velocity and deep-learning , 2021, Nature Communications.

[5]  Gregory B. Gloor,et al.  Editorial: Compositional data analysis and related methods applied to genomics—a first special issue from NAR Genomics and Bioinformatics , 2020, NAR genomics and bioinformatics.

[6]  Thomas P. Quinn,et al.  Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data , 2020, NAR genomics and bioinformatics.

[7]  David R. Lovell,et al.  Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences , 2020, NAR genomics and bioinformatics.

[8]  I. Erb,et al.  Partial correlations in compositional data analysis , 2019, Applied Computing and Geosciences.

[9]  Christian L. Müller,et al.  Shrinkage improves estimation of microbial associations under different normalization methods , 2020, NAR genomics and bioinformatics.

[10]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[11]  Thomas P. Quinn,et al.  Differential proportionality –a normalization-free approach to differential gene expression , 2017, bioRxiv.

[12]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[13]  Erik van Nimwegen,et al.  Inferring Contacting Residues within and between Proteins: What Do the Probabilities Mean? , 2016, PLoS Comput. Biol..

[14]  Cédric Notredame,et al.  How should we measure proportionality on relative gene expression data? , 2016, Theory in Biosciences.

[15]  Javier Palarea-Albaladejo,et al.  zCompositions — R package for multivariate imputation of left-censored data under a compositional approach , 2015 .

[16]  David R. Lovell,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, PLoS Comput. Biol..

[17]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[18]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[19]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[20]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[21]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[22]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[23]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[24]  M. Degroot Optimal Statistical Decisions , 1970 .

[25]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[26]  M. Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks , graphical gaussian models and bayesian networks , 2006 .

[27]  R. Morgan Genetics and molecular biology. , 1995, Current opinion in lipidology.

[28]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .