Deconvolution of heterogeneous tumor samples using partial reference signals

Deconvolution of heterogeneous bulk tumor samples into distinct cellular populations is an important yet challenging problem, particularly when only partial references are available. A common approach to dealing with this problem is to deconvolve the mixed signals using available references and leverage the remaining signal as a new cell component. However, as indicated in our simulation, such an approach tends to over-estimate the proportions of known cell types and fails to detect novel cell types. Here, we propose PREDE, a partial reference-based deconvolution method using an iterative non-negative matrix factorization algorithm. Our method is verified to be effective in estimating cell proportions and expression profiles of unknown cell types based on simulated datasets at a variety of parameter settings. Applying our method to TCGA tumor samples, we found that proportions of pure cancer cells better indicate different subtypes of tumor samples. We also detected several cell types for each cancer type whose proportions successfully predicted patient survival. Our method makes a significant contribution to deconvolution of heterogeneous tumor samples and could be widely applied to varieties of high throughput bulk data. PREDE is implemented in R and is freely available from GitHub (https://xiaoqizheng.github.io/PREDE).

[1]  Eran Halperin,et al.  Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies , 2016, Nature Methods.

[2]  Wenyi Wang,et al.  DeMix: deconvolution for mixed cancer transcriptomes using raw measured data , 2013, Bioinform..

[3]  Wei Lu,et al.  Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration , 2017, bioRxiv.

[4]  Mark M. Davis,et al.  Cell type–specific gene expression differences in complex tissues , 2010, Nature Methods.

[5]  Srinivas Ramachandran,et al.  Precise genome-wide mapping of single nucleosomes and linkers in vivo , 2018, Genome Biology.

[6]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[7]  H. Akaike A new look at the statistical model identification , 1974 .

[8]  Elaine Fuchs,et al.  TGF-β Promotes Heterogeneity and Drug Resistance in Squamous Cell Carcinoma , 2015, Cell.

[9]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[10]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[11]  E. Eskin,et al.  BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference , 2018, Genome Biology.

[12]  Andrey A. Shabalin,et al.  Correcting for cell-type effects in DNA methylation studies: reference-based method outperforms latent variable approaches in empirical studies , 2017, Genome Biology.

[13]  Alexander R. Abbas,et al.  Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data , 2005, Genes and Immunity.

[14]  Quaid Morris,et al.  Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction , 2013, Genome Medicine.

[15]  R. Irizarry,et al.  Accounting for cellular heterogeneity is critical in epigenome-wide association studies , 2014, Genome Biology.

[16]  Alfonso Valencia,et al.  Genome-wide analysis of differential transcriptional and epigenetic variability across human immune cell types , 2016, bioRxiv.

[17]  Kentaro Inamura,et al.  Bladder Cancer: New Insights into Its Molecular Pathology , 2018, Cancers.

[18]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[19]  Matthias Hein,et al.  MeDeCom: discovery and quantification of latent components of heterogeneous methylomes , 2017, Genome Biology.

[20]  Adrian V. Lee,et al.  Epigenomic Deconvolution of Breast Tumors Reveals Metabolic Coupling between Constituent Cell Types. , 2016, Cell reports.

[21]  Jun S. Liu,et al.  Comprehensive analyses of tumor immunity: implications for cancer immunotherapy , 2016, Genome Biology.

[22]  Boxi Kang,et al.  Understanding tumor ecosystems by single-cell sequencing: promises and limitations , 2018, Genome Biology.

[23]  J. Cavanaugh Unifying the derivations for the Akaike and corrected Akaike information criteria , 1997 .

[24]  Andrew E. Teschendorff,et al.  A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies , 2017, BMC Bioinformatics.

[25]  Z. Werb,et al.  Matrix Metalloproteinases: Regulators of the Tumor Microenvironment , 2010, Cell.

[26]  D. Fearon,et al.  T cell exclusion, immune privilege, and the tumor microenvironment , 2015, Science.

[27]  I. Amit,et al.  Digital cell quantification identifies global immune cell dynamics during influenza infection , 2014, Molecular systems biology.

[28]  N. Neff,et al.  Quantitative assessment of single-cell RNA-sequencing methods , 2013, Nature Methods.

[29]  Donovan H Parks,et al.  Measuring community similarity with phylogenetic networks. , 2012, Molecular biology and evolution.

[30]  Martin J. Aryee,et al.  Epigenome-wide association studies without the need for cell-type composition , 2014, Nature Methods.

[31]  Pekka Ruusuvuori,et al.  Probabilistic analysis of gene expression measurements from heterogeneous tissues , 2010, Bioinform..

[32]  R. Seiler,et al.  Molecular subtypes and response to immunotherapy in bladder cancer patients. , 2019, Translational andrology and urology.

[33]  E. Andres Houseman,et al.  Reference-free cell mixture adjustments in analysis of DNA methylation data , 2014, Bioinform..

[34]  Kai Kang,et al.  CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data , 2019, PLoS Comput. Biol..

[35]  J. Szustakowski,et al.  Optimal Deconvolution of Transcriptional Profiling Data Using Quadratic Programming with Application to Complex Clinical Blood Samples , 2011, PloS one.

[36]  David Gomez-Cabrero,et al.  ChainRank, a chain prioritisation method for contextualisation of biological networks , 2016, BMC Bioinformatics.

[37]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[38]  Marc A Marti-Renom,et al.  Distinct roles of cohesin-SA1 and cohesin-SA2 in 3D chromosome organization , 2017, Nature Structural & Molecular Biology.

[39]  Andrew E. Teschendorff,et al.  Statistical and integrative system-level analysis of DNA methylation data , 2017, Nature Reviews Genetics.

[40]  E. Andres Houseman,et al.  Reference-free deconvolution of DNA methylation data and mediation by cell composition effects , 2016, BMC Bioinformatics.

[41]  Karthik Devarajan,et al.  Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology , 2008, PLoS Comput. Biol..

[42]  D. Speiser,et al.  Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data , 2017, bioRxiv.

[43]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[44]  I. Ellis,et al.  Tumour-infiltrating macrophages and clinical outcome in breast cancer , 2011, Journal of Clinical Pathology.

[45]  Tom C Freeman,et al.  An expression atlas of human primary cells: inference of gene function from coexpression networks , 2013, BMC Genomics.

[46]  Noam Brown,et al.  The role of tumour‐associated macrophages in tumour progression: implications for new anticancer therapies , 2002, The Journal of pathology.