Systematic evaluation of transcriptomics-based deconvolution methods and references using thousands of clinical samples

Estimating cell type composition of blood and tissue samples is a biological challenge relevant in both laboratory studies and clinical care. In recent years, a number of computational tools have been developed to estimate cell type abundance using gene expression data. While these tools use a variety of approaches, they all leverage expression profiles from purified cell types to evaluate the cell type composition within samples. In this study, we compare ten deconvolution tools and evaluate their performance while using each of eleven separate reference profiles. Specifically, we have run deconvolution tools on over 4,000 samples with known cell type proportions, spanning both immune and stromal cell types. Twelve of these represent in vitro synthetic mixtures and 300 represent in silico synthetic mixtures prepared using single cell data. A final 3,728 clinical samples have been collected from the Framingham Cohort, for which cell populations have been quantified using electrical impedance cell counting. When tools are applied to the Framingham dataset, the tool EPIC produces the highest correlation while GEDIT produces the lowest error. The best tool for other datasets is varied, but CIBERSORT and GEDIT most consistently produce accurate results. In terms of reference choice, we find that the Human Primary Cell Atlas (HPCA) and references published by the EPIC authors produce accurate results for the largest number of tools and datasets. When applying deconvolution to blood samples, the leukocyte reference matrix LM22 is also a suitable choice, usually (but not always) outperforming HPCA and EPIC. Running time varies substantially across tools. For as many as 5052 samples, SaVanT and dtangle reliably finish in under one minute, while slower tools may require up to two hours. However, when using custom references, CIBERSORT can run very slowly, taking over 24 hours to complete for large datasets. We conclude that combining the best tools with optimal reference datasets can provide significant gains in accuracy when carrying out deconvolution tasks.

[1]  Francisco Avila Cobos,et al.  Author Correction: Benchmarking of cell type deconvolution pipelines for transcriptomics data , 2020, Nature Communications.

[2]  Matteo Pellegrini,et al.  The Gene Expression Deconvolution Interactive Tool (GEDIT): accurate cell type quantification from gene expression data , 2019, bioRxiv.

[3]  Daniel Jost,et al.  Guidelines for cell-type heterogeneity quantification based on a comparative analysis of reference-free DNA methylation deconvolution software , 2019, BMC Bioinformatics.

[4]  Jan Baumbach,et al.  Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology , 2019, Bioinform..

[5]  Ash A. Alizadeh,et al.  Determining cell-type abundance and expression from bulk tissues with digital cytometry , 2019, Nature Biotechnology.

[6]  Geng Chen,et al.  Single-Cell RNA-Seq Technologies and Related Computational Data Analysis , 2019, Front. Genet..

[7]  Lana S. Martin,et al.  Systematic benchmarking of omics computational tools , 2019, Nature Communications.

[8]  Boxi Kang,et al.  Understanding tumor ecosystems by single-cell sequencing: promises and limitations , 2018, Genome Biology.

[9]  Mark M. Davis,et al.  Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases , 2018, Nature Communications.

[10]  Gregory J. Hunt,et al.  Dtangle: Accurate and Robust Cell Type Deconvolution , 2018, Bioinform..

[11]  Martin L. Miller,et al.  Comprehensive Benchmarking and Integration of Tumour Microenvironment Cell Estimation Methods , 2018, bioRxiv.

[12]  Yu Wang,et al.  Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data , 2017, Genome Medicine.

[13]  Shawn M. Gillespie,et al.  Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer , 2017, Cell.

[14]  M. Pellegrini,et al.  SaVanT: a web-based tool for the sample-level visualization of molecular signatures in gene expression profiles , 2017, BMC Genomics.

[15]  D. Speiser,et al.  Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data , 2017, bioRxiv.

[16]  A. Butte,et al.  xCell: digitally portraying the tissue cellular heterogeneity landscape , 2017, bioRxiv.

[17]  P. Laurent-Puig,et al.  Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression , 2016, Genome Biology.

[18]  Jun S. Liu,et al.  Comprehensive analyses of tumor immunity: implications for cancer immunotherapy , 2016, Genome Biology.

[19]  Amit Frishberg,et al.  ImmQuant: a user-friendly tool for inferring immune cell-type composition from gene-expression data , 2016, Bioinform..

[20]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[21]  Hae Kyung Im,et al.  Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues , 2016, bioRxiv.

[22]  I. Hellmann,et al.  Comparative Analysis of Single-Cell RNA Sequencing Methods , 2016, bioRxiv.

[23]  Ash A. Alizadeh,et al.  Abstract PR09: The prognostic landscape of genes and infiltrating immune cells across human cancers , 2015 .

[24]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[25]  Irene Kuhn,et al.  Sorting out the FACS: a devil in the details. , 2014, Cell reports.

[26]  I. Amit,et al.  Digital cell quantification identifies global immune cell dynamics during influenza infection , 2014, Molecular systems biology.

[27]  H. Stunnenberg,et al.  BLUEPRINT: mapping human blood cell epigenomes , 2013, Haematologica.

[28]  Tom C Freeman,et al.  An expression atlas of human primary cells: inference of gene function from coexpression networks , 2013, BMC Genomics.

[29]  Ting Gong,et al.  DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data , 2013, Bioinform..

[30]  C. Sautès-Fridman,et al.  The immune contexture in human tumours: impact on clinical outcome , 2012, Nature Reviews Cancer.

[31]  Qiong Yang,et al.  The Third Generation Cohort of the National Heart, Lung, and Blood Institute's Framingham Heart Study: design, recruitment, and initial examination. , 2007, American journal of epidemiology.

[32]  W. Kannel,et al.  An investigation of coronary heart disease in families. The Framingham offspring study. , 1979, American journal of epidemiology.

[33]  W. Kannel,et al.  The Framingham Offspring Study. Design and preliminary data. , 1975, Preventive medicine.

[34]  T. Dawber,et al.  Epidemiological approaches to heart disease: the Framingham Study. , 1951, American journal of public health and the nation's health.

[35]  E. Benjamin,et al.  with The Framingham Offspring Study. , 2018 .

[36]  W. Enard,et al.  Comparative analysis of single-cell RNA-sequencing methods , 2015 .