Benchmarking of cell type deconvolution pipelines for transcriptomics data

Many computational methods have been developed to infer cell type proportions from bulk transcriptomics data. However, an evaluation of the impact of data transformation, pre-processing, marker selection, cell type composition and choice of methodology on the deconvolution results is still lacking. Using five single-cell RNA-sequencing (scRNA-seq) datasets, we generate pseudo-bulk mixtures to evaluate the combined impact of these factors. Both bulk deconvolution methodologies and those that use scRNA-seq data as reference perform best when applied to data in linear scale and the choice of normalization has a dramatic impact on some, but not all methods. Overall, methods that use scRNA-seq data have comparable performance to the best performing bulk methods whereas semi-supervised approaches show higher error values. Moreover, failure to include cell types in the reference that are present in a mixture leads to substantially worse results, regardless of the previous choices. Altogether, we evaluate the combined impact of factors affecting the deconvolution task across different datasets and propose general guidelines to maximize its performance.

[1]  Ting Gong,et al.  DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data , 2013, Bioinform..

[2]  Aaron T. L. Lun,et al.  Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R , 2017, Bioinform..

[3]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[4]  Renaud Gaujoux,et al.  CellMix: a comprehensive toolbox for gene expression deconvolution , 2013, Bioinform..

[5]  P. Robinson,et al.  Whole-exome sequencing for finding de novo mutations in sporadic mental retardation , 2010, Genome Biology.

[6]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[7]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[8]  Jun S. Liu,et al.  Comprehensive analyses of tumor immunity: implications for cancer immunotherapy , 2016, Genome Biology.

[9]  Aaron Lun,et al.  Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data , 2018, bioRxiv.

[10]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[11]  Johanna Hardin,et al.  Selecting between‐sample RNA‐Seq normalization methods from the perspective of their assumptions , 2016, Briefings Bioinform..

[12]  Carsten Denkert,et al.  Assessing Tumor-Infiltrating Lymphocytes in Solid Tumors: A Practical Review for Pathologists and Proposal for a Standardized Method from the International Immuno-Oncology Biomarkers Working Group: Part 2: TILs in Melanoma, Gastrointestinal Tract Carcinomas, Non-Small Cell Lung Carcinoma and Mesothe , 2017, Advances in anatomic pathology.

[13]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[14]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[15]  D. Speiser,et al.  Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data , 2017, bioRxiv.

[16]  E. Halperin,et al.  Accurate estimation of cell composition in bulk expression through robust integration of single-cell information , 2020, Nature Communications.

[17]  Maxim N. Artyomov,et al.  Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures , 2019, Nature Communications.

[18]  D. Choudhuri,et al.  Exceptional increase in the creep life of magnesium rare-earth alloys due to localized bond stiffening , 2017, Nature Communications.

[19]  Florian Wagner,et al.  Straightforward clustering of single-cell RNA-Seq data with t-SNE and DBSCAN , 2019, bioRxiv.

[20]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[21]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[22]  S. Quake,et al.  Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns , 2017, Cell.

[23]  M. Ceccarelli,et al.  RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types , 2019, Cell reports.

[24]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[25]  Hans-Jürgen Thiesen,et al.  Robust computational reconstitution – a new method for the comparative analysis of gene expression in tissues and isolated cell fractions , 2006, BMC Bioinformatics.

[26]  Yuying Xie,et al.  Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares , 2019, PLoS computational biology.

[27]  Dan Zhang,et al.  Construction of a human cell landscape at single-cell level , 2020, Nature.

[28]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[29]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[30]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[31]  S. M. Toor,et al.  Immune checkpoint inhibitors: recent progress and potential biomarkers , 2018, Experimental & Molecular Medicine.

[32]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[33]  D. M. Smith,et al.  Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes , 2016, Cell metabolism.

[34]  Lawrence A. David,et al.  Naught all zeros in sequence count data are the same. , 2020, Computational and structural biotechnology journal.

[35]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[36]  C. Seoighe,et al.  Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. , 2012, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[37]  J. Fox,et al.  Combining discovery and targeted proteomics reveals a prognostic signature in oral cancer , 2018, Nature Communications.

[38]  Francesco Vallania,et al.  Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases , 2018, Nature Communications.

[39]  Pak Chung Sham,et al.  Linnorm: improved statistical analysis for single cell RNA-seq expression data , 2017, Nucleic acids research.

[40]  Zhandong Liu,et al.  Gene expression deconvolution in linear space , 2011, Nature Methods.

[41]  Zhiyuan Hu,et al.  Systematic Bias in Genomic Classification Due to Contaminating Non-neoplastic Tissue in Breast Tumor Samples , 2011, BMC Medical Genomics.

[42]  A. Bhardwaj,et al.  In situ click chemistry generation of cyclooxygenase-2 inhibitors , 2017, Nature Communications.

[43]  Jan Baumbach,et al.  Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology , 2019, Bioinform..

[44]  Rose Du,et al.  deconvSeq: deconvolution of cell mixture distribution in sequencing data , 2019, Bioinform..

[45]  M. Reinders,et al.  A comparison of automatic cell identification methods for single-cell RNA sequencing data , 2019, Genome Biology.

[46]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[47]  C. Perou,et al.  SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references , 2020, Briefings in bioinformatics.

[48]  I. Amit,et al.  Digital cell quantification identifies global immune cell dynamics during influenza infection , 2014, Molecular systems biology.

[49]  Francisco Avila Cobos,et al.  Computational deconvolution of transcriptomics data from mixed cell populations , 2018, Bioinform..

[50]  Nancy R. Zhang,et al.  Bulk tissue cell type deconvolution with multi-subject single-cell expression reference , 2018, Nature Communications.

[51]  Ash A. Alizadeh,et al.  Data normalization considerations for digital tumor dissection , 2017, Genome Biology.

[52]  Edda Klipp,et al.  Estimation of immune cell content in tumour tissue using single-cell RNA-seq data , 2017, Nature Communications.

[53]  Yu Wang,et al.  Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data , 2017, Genome Medicine.

[54]  Magnus Rattray,et al.  Making sense of microarray data distributions , 2002, Bioinform..

[55]  F. W. Townes,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[56]  Allan R. Wilks,et al.  The new S language: a programming environment for data analysis and graphics , 1988 .

[57]  S. Darby,et al.  Self-sharpening induces jet-like structure in seafloor gravity currents , 2019, Nature Communications.

[58]  Guo-Cheng Yuan,et al.  Accurate estimation of cell-type composition from gene expression data , 2019, Nature Communications.

[59]  Non-Genetic Intra-Tumor Heterogeneity Is a Major Predictor of Phenotypic Heterogeneity and Ongoing Evolutionary Dynamics in Lung Tumors , 2019, Cell reports.

[60]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[61]  Gregory J. Hunt,et al.  Dtangle: Accurate and Robust Cell Type Deconvolution , 2018, Bioinform..

[62]  R. Satija,et al.  Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression , 2019, Genome Biology.

[63]  Yi Zhong,et al.  Digital sorting of complex tissues for cell type-specific gene expression profiles , 2013, BMC Bioinformatics.