scDesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

In the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. Although dozens of simulators have been developed for single-cell RNA-seq (scRNA-seq) data, they lack the capacity to simultaneously achieve all the three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, here we propose scDesign2, an interpretable simulator that achieves all the three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs.

[1]  Chenghang Zong,et al.  Effective detection of variation in single-cell transcriptomes using MATQ-seq , 2017, Nature Methods.

[2]  Charles H. Yoon,et al.  Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq , 2016, Science.

[3]  V. Bansal,et al.  Genome-wide association study results for educational attainment aid in identifying genetic heterogeneity of schizophrenia , 2018, Nature Communications.

[4]  Boyang Li,et al.  Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data , 2019, BMC Bioinformatics.

[5]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[6]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[7]  Ashwinikumar Kulkarni,et al.  Beyond bulk: a review of single cell transcriptomics methodologies and applications. , 2019, Current opinion in biotechnology.

[8]  Chenwei Li,et al.  ROGUE: an entropy-based universal metric for assessing the purity of single cell population , 2019, bioRxiv.

[9]  Bertrand Z. Yeung,et al.  Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics , 2018, Genome Biology.

[10]  Shuqiang Li,et al.  CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq , 2016, Genome Biology.

[11]  Jingyi Jessica Li,et al.  A statistical simulator scDesign for rational scRNA-seq experimental design , 2018, bioRxiv.

[12]  E. Ballestar,et al.  IL-4 orchestrates STAT6-mediated DNA demethylation leading to dendritic cell differentiation , 2016, Genome Biology.

[13]  Levi Garraway,et al.  Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden , 2017, Genome Medicine.

[14]  Luke Zappia,et al.  Opportunities and challenges in long-read sequencing data analysis , 2020, Genome Biology.

[15]  Cole Trapnell,et al.  Supervised classification enables rapid annotation of cell atlases , 2019, Nature Methods.

[16]  Andrew J. Hill,et al.  The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[17]  K. Birnbaum Power in Numbers: Single-Cell RNA-Seq Strategies to Dissect Complex Tissues. , 2018, Annual review of genetics.

[18]  Barbara Di Camillo,et al.  How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives. , 2018, Briefings in bioinformatics.

[19]  Jussi Taipale,et al.  Counting absolute number of molecules using unique molecular identifiers , 2011 .

[20]  Hector Roux de Bézieux,et al.  Trajectory-based differential expression analysis for single-cell sequencing data , 2019, Nature Communications.

[21]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[22]  Fabian J Theis,et al.  Generalizing RNA velocity to transient cell states through dynamical modeling , 2019, Nature Biotechnology.

[23]  B. Tjaden,et al.  De novo assembly of bacterial transcriptomes from RNA-seq data , 2015, Genome Biology.

[24]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[25]  Patrick Cahan,et al.  SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species , 2018, bioRxiv.

[26]  J. Li,et al.  Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data. , 2020, Cell systems.

[27]  Gerald Quon,et al.  scAlign: a tool for alignment, integration, and rare cell identification from scRNA-seq data , 2018, Genome Biology.

[28]  Valentine Svensson,et al.  Droplet scRNA-seq is not zero-inflated , 2019, Nature Biotechnology.

[29]  Aleksandra A. Kolodziejczyk,et al.  The technology and biology of single-cell RNA sequencing. , 2015, Molecular cell.

[30]  Robert Tibshirani,et al.  Mapping lung cancer epithelial-mesenchymal transition states and trajectories with single-cell resolution , 2019, Nature Communications.

[31]  Sean C. Bendall,et al.  Mapping lung cancer epithelial-mesenchymal transition states and trajectories with single-cell resolution , 2019, Nature Communications.

[32]  Saurabh Sinha,et al.  A single-cell expression simulator guided by gene regulatory networks , 2019, bioRxiv.

[33]  S. Phinn,et al.  Australian vegetated coastal ecosystems as global hotspots for climate change mitigation , 2019, Nature Communications.

[34]  M. Lenzen,et al.  Scientists’ warning on affluence , 2020, Nature Communications.

[35]  Zev J. Gartner,et al.  DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors , 2018, bioRxiv.

[36]  Matthew Stephens,et al.  Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis , 2020, Nature Genetics.

[37]  Michael Gruenstaeudl,et al.  PACVr: plastome assembly coverage visualization in R , 2020, BMC Bioinformatics.

[38]  Alvaro Plaza Reyes,et al.  Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos , 2016, Cell.

[39]  Alexey M. Kozlov,et al.  Eleven grand challenges in single-cell data science , 2020, Genome Biology.

[40]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[41]  J. C. Love,et al.  Seq-Well: A Portable, Low-Cost Platform for High-Throughput Single-Cell RNA-Seq of Low-Input Samples , 2017, Nature Methods.

[42]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[43]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[44]  Kenneth D. Harris,et al.  Probabilistic cell typing enables fine mapping of closely related cell types in situ , 2019, Nature Methods.

[45]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[46]  Kok Siong Ang,et al.  A benchmark of batch-effect correction methods for single-cell RNA sequencing data , 2020, Genome Biology.

[47]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[48]  M. Cugmas,et al.  On comparing partitions , 2015 .

[49]  L. Rüschendorf Copulas, Sklar’s Theorem, and Distributional Transform , 2013 .

[50]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[51]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[52]  S. Teichmann,et al.  Exponential scaling of single-cell RNA-seq in the past decade , 2017, Nature Protocols.

[53]  Zhigang Zhang,et al.  scIGANs: single-cell RNA-seq imputation using generative adversarial networks , 2020, bioRxiv.

[54]  Sarah A. Teichmann,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016 .

[55]  Exosomal miR-196a derived from cancer-associated fibroblasts confers cisplatin resistance in head and neck cancer through targeting CDKN1B and ING5 , 2019, Genome Biology.

[56]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[57]  S. Quake,et al.  A survey of human brain transcriptome diversity at the single cell level , 2015, Proceedings of the National Academy of Sciences.

[58]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[59]  Daphne M. Tsoucas,et al.  GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection , 2018, Genome Biology.

[60]  K. Holt,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[61]  Charlotte Soneson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data , 2018, F1000Research.

[62]  Allon M Klein,et al.  Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. , 2019, Cell systems.

[63]  Mauro J. Muraro,et al.  A Single-Cell Transcriptome Atlas of the Human Pancreas , 2016, Cell systems.

[64]  Fabian J. Theis,et al.  Alveolar regeneration through a Krt8+ transitional stem cell state that persists in human lung fibrosis , 2020, Nature Communications.

[65]  I. Hellmann,et al.  Comparative Analysis of Single-Cell RNA Sequencing Methods , 2016, bioRxiv.

[66]  Johannes Söding,et al.  PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes , 2018, bioRxiv.

[67]  Geng Chen,et al.  Single-Cell RNA-Seq Technologies and Related Computational Data Analysis , 2019, Front. Genet..

[68]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[69]  Barbara Di Camillo,et al.  SPARSim single cell: a count data simulator for scRNA-seq data , 2019, Bioinform..

[70]  S. Potter,et al.  Single-cell RNA sequencing for the study of development, physiology and disease , 2018, Nature Reviews Nephrology.

[71]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[72]  S. Teichmann,et al.  SpatialDE: identification of spatially variable genes , 2018, Nature Methods.

[73]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[74]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[75]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[76]  S. Teichmann,et al.  A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications , 2017, Genome Medicine.

[77]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[78]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[79]  Åsa K. Björklund,et al.  Full-length RNA-seq from single cells using Smart-seq2 , 2014, Nature Protocols.

[80]  N. Neff,et al.  Developmental Heterogeneity of Microglia and Brain Myeloid Cells Revealed by Deep Single-Cell RNA Sequencing , 2018, Neuron.

[81]  Xuequn Shang,et al.  Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies , 2020, Nature Communications.

[82]  Russell B. Fletcher,et al.  Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics , 2017, BMC Genomics.

[83]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[84]  Ken S Lau,et al.  Optimized multiplex immunofluorescence single-cell analysis reveals tuft cell heterogeneity. , 2017, JCI insight.

[85]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[87]  Nir Yosef,et al.  Simulating multiple faceted variability in single cell RNA sequencing , 2019, Nature Communications.

[88]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[89]  Pierre Machart,et al.  Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks , 2020, Nature Communications.

[90]  Yvan Saeys,et al.  dyngen: a multi-modal simulator for spearheading new single-cell omics analyses , 2020, bioRxiv.

[91]  Stephanie C. Hicks,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome Biology.

[92]  Martin Jinye Zhang,et al.  Determining sequencing depth in a single-cell RNA-seq experiment , 2020, Nature Communications.

[93]  Guo-Cheng Yuan,et al.  GiniClust3: a fast and memory-efficient tool for rare cell type identification , 2019, BMC Bioinformatics.

[94]  Jiacheng Yao,et al.  Comparative analysis of droplet-based ultra-high-throughput single-cell RNA-seq systems , 2018, bioRxiv.

[95]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[96]  Rafael A. Irizarry,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[97]  N. Hacohen,et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors , 2017, Science.

[98]  Yarden Katz,et al.  A single-cell survey of the small intestinal epithelium , 2017, Nature.

[99]  Jeong Eon Lee,et al.  Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer , 2017, Nature Communications.

[100]  Gerome Breen,et al.  Genetic identification of brain cell types underlying schizophrenia , 2017, Nature Genetics.

[101]  Christoph Ziegenhain,et al.  powsimR: Power analysis for bulk and single cell RNA-seq experiments , 2017, bioRxiv.

[102]  J. Marioni,et al.  Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data , 2016, bioRxiv.

[103]  Jayadeva,et al.  Discovery of rare cells from voluminous single cell expression data , 2018, Nature Communications.

[104]  Luyi Tian,et al.  Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments , 2019, Nature Methods.

[105]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[106]  Aviv Regev,et al.  Systematic comparison of single-cell and single-nucleus RNA-sequencing methods , 2020, Nature Biotechnology.

[107]  M. Hemberg,et al.  Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[108]  R. Stewart,et al.  Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm , 2016, Genome Biology.

[109]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[110]  Nimrod D. Rubinstein,et al.  Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region , 2018, Science.

[111]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[112]  Shiquan Sun,et al.  Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies , 2020, Nature Methods.

[113]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[114]  Haiyan Huang,et al.  Network Modeling in Biology: Statistical Methods for Gene and Brain Networks. , 2021, Statistical science : a review journal of the Institute of Mathematical Statistics.

[115]  Keegan D. Korthauer,et al.  A statistical approach for identifying differential distributions in single-cell RNA-seq experiments , 2016, Genome Biology.

[116]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[117]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods , 2019, Nature Biotechnology.

[118]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[119]  Pradeep Ravikumar,et al.  A review of multivariate distributions for count data derived from the Poisson distribution , 2016, Wiley interdisciplinary reviews. Computational statistics.

[120]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.