Comparison of transformations for single-cell RNA-seq data

The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-seq data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state, and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties. However, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal component analysis, performs as well or better than the more sophisticated alternatives. Software The R package transformGamPoi implementing the delta method- and residuals-based variance-stabilizing transformations is available via Bioconductor. We provide an interactive website to explore the benchmark results at shiny-portal.embl.de/shinyapps/app/08_single-cell_transformation_benchmark. Contact constantin.ahlmann@embl.de

[1]  Davis J. McCarthy,et al.  A comparison of marker gene selection methods for single-cell RNA sequencing data , 2022, bioRxiv.

[2]  A. S. Booeshaghi,et al.  Depth normalization for single-cell genomics count data , 2022, bioRxiv.

[3]  R. Sandberg,et al.  Transcriptional kinetics and molecular functions of long non-coding RNAs , 2020, bioRxiv.

[4]  Lingfei Wang Single-cell normalization and association testing unifying CRISPR screen and gene co-expression analyses with Normalisr , 2021, Nature Communications.

[5]  T. Kanda,et al.  Identification of conserved SARS-CoV-2 spike epitopes that expand public cTfh clonotypes in mild COVID-19 patients , 2021, The Journal of experimental medicine.

[6]  R. Sebra,et al.  Single‐cell RNA‐sequencing atlas of bovine caudal intervertebral discs: Discovery of heterogeneous cell populations with distinct roles in homeostasis , 2021, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[7]  M. Fishbein,et al.  NLRP3 Inflammasome Mediates Immune-Stromal Interactions in Vasculitis , 2021, Circulation research.

[8]  S. Kaech,et al.  ZEB1 promotes pathogenic Th1 and Th17 cell differentiation in multiple sclerosis , 2021, Cell reports.

[9]  A. Brivanlou,et al.  Self-organization of human dorsal-ventral forebrain structures by light induced SHH , 2021, Nature Communications.

[10]  D. Risso,et al.  NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data , 2021, bioRxiv.

[11]  R. Satija,et al.  Comparison and evaluation of statistical error models for scRNA-seq , 2021, Genome Biology.

[12]  Y. Saeys,et al.  Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells , 2021, Nature Communications.

[13]  P. Kharchenko The triumphs and limitations of computational methods for scRNA-seq , 2021, Nature Methods.

[14]  J. Li,et al.  scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured , 2021, Genome Biology.

[15]  E. van Nimwegen,et al.  Bayesian inference of gene expression states from single-cell RNA-seq data , 2021, Nature Biotechnology.

[16]  D. Risso,et al.  PsiNorm: a scalable normalization for single-cell RNA-seq data , 2021, bioRxiv.

[17]  M. Hirst,et al.  MYC-induced human acute myeloid leukemia requires a continuing IL-3/GM-CSF costimulus. , 2020, Blood.

[18]  Philipp Berens,et al.  Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data , 2020, Genome Biology.

[19]  Helena L. Crowell,et al.  muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data , 2020, Nature Communications.

[20]  Christina Kendziorski,et al.  Normalization by distributional resampling of high throughput single-cell RNA-sequencing data , 2020, bioRxiv.

[21]  Wolfgang Huber,et al.  glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data , 2020, bioRxiv.

[22]  P. Wolters,et al.  Human alveolar Type 2 epithelium transdifferentiates into metaplastic KRT5+ basal cells , 2020, bioRxiv.

[23]  Mark D. Robinson,et al.  pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools , 2020, Genome Biology.

[24]  R. Sandberg,et al.  Single-cell RNA counting at allele and isoform resolution using Smart-seq3 , 2019, Nature Biotechnology.

[25]  Fabian J Theis,et al.  Current best practices in single‐cell RNA‐seq analysis: a tutorial , 2019, Molecular systems biology.

[26]  R. Sandberg,et al.  Transcriptional bursts explain autosomal random monoallelic expression and affect allelic imbalance , 2019, bioRxiv.

[27]  Raphael Gottardo,et al.  Orchestrating single-cell analysis with Bioconductor , 2019, Nature Methods.

[28]  Valentine Svensson,et al.  Droplet scRNA-seq is not zero-inflated , 2019, Nature Biotechnology.

[29]  R. Satija,et al.  Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression , 2019, Genome Biology.

[30]  Rafael A. Irizarry,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[31]  Aaron Lun,et al.  Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data , 2018, bioRxiv.

[32]  Lucas E. Wange,et al.  Sensitive and powerful single-cell RNA sequencing using mcSCRB-seq , 2018, Nature Communications.

[33]  Fabian J Theis,et al.  An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics , 2018, bioRxiv.

[34]  D. Warton Why you cannot transform your way out of trouble for small counts , 2018, Biometrics.

[35]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[36]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[37]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[38]  S. Shankar Sastry,et al.  Generalized Principal Component Analysis , 2016, Interdisciplinary applied mathematics.

[39]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[40]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[41]  Peter K. Dunn,et al.  Randomized Quantile Residuals , 1996 .

[42]  M. Bartlett,et al.  The use of transformations. , 1947, Biometrics.

[43]  H. Hotelling Relations Between Two Sets of Variates , 1936 .