Dhaka: Variational Autoencoder for Unmasking Tumor Heterogeneity from Single Cell Genomic Data

MOTIVATION Intra-tumor heterogeneity is one of the key confounding factors in deciphering tumor evolution. Malignant cells exhibit variations in their gene expression, copy numbers, and mutation even when originating from a single progenitor cell. Single cell sequencing of tumor cells has recently emerged as a viable option for unmasking the underlying tumor heterogeneity. However, extracting features from single cell genomic data in order to infer their evolutionary trajectory remains computationally challenging due to the extremely noisy and sparse nature of the data. RESULTS Here we describe 'Dhaka', a variational autoencoder method which transforms single cell genomic data to a reduced dimension feature space that is more efficient in differentiating between (hidden) tumor subpopulations. Our method is general and can be applied to several different types of genomic data including copy number variation from scDNA-Seq and gene expression from scRNA-Seq experiments. We tested the method on synthetic and 6 single cell cancer datasets where the number of cells ranges from 250 to 6000 for each sample. Analysis of the resulting feature space revealed subpopulations of cells and their marker genes. The features are also able to infer the lineage and/or differentiation trajectory between cells greatly improving upon prior methods suggested for feature extraction and dimensionality reduction of such data. AVAILABILITY AND IMPLEMENTATION All the datasets used in the paper are publicly available and developed software package and supporting info is available on Github https://github.com/MicrosoftGenomics/Dhaka.

[1]  I. Jolliffe Principal Component Analysis and Factor Analysis , 1986 .

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  R. Salvayre,et al.  Significance of two point mutations present in each HEXB allele of patients with adult GM2 gangliosidosis (Sandhoff disease) homozygosity for the Ile207-->Val substitution is not associated with a clinical or biochemical phenotype. , 1996, Biochimica et biophysica acta.

[4]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[5]  A. Raz,et al.  Tumor autocrine motility factor is an angiogenic factor that stimulates endothelial cell motility. , 2002, Biochemical and biophysical research communications.

[6]  A. Geirsson,et al.  Human trophoblast noncoding RNA suppresses CIITA promoter III activity in murine B-lymphocytes. , 2003, Biochemical and biophysical research communications.

[7]  P. Tosi,et al.  Translationally controlled tumor protein (TCTP) in the human prostate and prostate cancer cells: Expression, distribution, and calcium binding activity , 2004, The Prostate.

[8]  J. H. Zar,et al.  Spearman Rank Correlation , 2005 .

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  E. Nkenke,et al.  Gene polymorphisms related to angiogenesis, inflammation and thrombosis that influence risk for oral cancer. , 2009, Oral oncology.

[13]  James Hicks,et al.  Tracing the tumor lineage , 2010, Molecular oncology.

[14]  B. Stack,et al.  Extraribosomal function of metallopanstimulin‐1: reducing paxillin in head and neck squamous cell carcinoma and inhibiting tumor growth , 2010, International journal of cancer.

[15]  H. Harbo,et al.  RGMA and IL21R show association with experimental inflammation and multiple sclerosis , 2010, Genes and Immunity.

[16]  James M. Joyce Kullback-Leibler Divergence , 2011, International Encyclopedia of Statistical Science.

[17]  J. Hicks,et al.  Insight into the heterogeneity of breast cancer through next-generation sequencing. , 2011, The Journal of clinical investigation.

[18]  S. Tsuji,et al.  The TRK-fused gene is mutated in hereditary motor and sensory neuropathy with proximal dominant involvement. , 2012, American journal of human genetics.

[19]  X. Xie,et al.  Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell , 2012, Science.

[20]  Kevin A. Pelphrey,et al.  Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell , 2012 .

[21]  H. Tsuda,et al.  Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor. , 2012, Journal of proteomics.

[22]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[23]  Philip M. Kim,et al.  Characterizing WW Domain Interactions of Tumor Suppressor WWOX Reveals Its Association with Multiprotein Networks* , 2014, The Journal of Biological Chemistry.

[24]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  Z. Szallasi,et al.  Spatial and temporal diversity in genomic instability processes defines lung cancer evolution , 2014, Science.

[26]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[27]  Aman Gupta,et al.  Learning structure in gene expression data using deep architectures, with an application to gene clustering , 2015 .

[28]  Yu-Jin Jung,et al.  Identification of Distinct Tumor Subpopulations in Lung Adenocarcinoma via Single-Cell RNA-seq , 2015, PloS one.

[29]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[30]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[31]  Chen Xu,et al.  Identification of cell types from single-cell transcriptomes using a novel clustering method , 2015, Bioinform..

[32]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[33]  Mariella G. Filbin,et al.  Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma , 2016, Nature.

[34]  Hanlee P. Ji,et al.  Pan-cancer analysis of the extent and consequences of intratumor heterogeneity , 2015, Nature Medicine.

[35]  Joseph L. Herman,et al.  Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis , 2015, Nature Methods.

[36]  Nir Yosef,et al.  FastProject: a tool for low-dimensional analysis of single-cell RNA-Seq data , 2016, BMC Bioinformatics.

[37]  Charles H. Yoon,et al.  Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq , 2016, Science.

[38]  Sean C. Bendall,et al.  Wishbone identifies bifurcating developmental trajectories from single-cell data , 2016, Nature Biotechnology.

[39]  R. Sandberg,et al.  Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia , 2017, Nature Medicine.

[40]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[41]  Michael Q. Zhang,et al.  Network embedding-based representation learning for single cell RNA-seq data , 2017, Nucleic acids research.

[42]  Kevin R. Moon,et al.  MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data , 2017, bioRxiv.

[43]  Guoxian Yu,et al.  Clustering cancer gene expression data by projective clustering ensemble , 2017, PloS one.

[44]  Mariella G. Filbin,et al.  Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq , 2017, Science.

[45]  Nir Yosef,et al.  A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes , 2017, ArXiv.

[46]  Samuel Aparicio,et al.  Scalable whole-genome single-cell library preparation without preamplification , 2017, Nature Methods.

[47]  Z. Bar-Joseph,et al.  Using neural networks for reducing the dimensions of single-cell RNA-Seq data , 2017, Nucleic acids research.