A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

There are distinguishing features or “hallmarks” of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

[1]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[2]  Yuan Ji,et al.  PairClone: a Bayesian subclone caller based on mutation pairs , 2017, Journal of the Royal Statistical Society: Series C (Applied Statistics).

[3]  Brendan F. Kohrn,et al.  Extensive subclonal mutational diversity in human colorectal cancer and its significance , 2019, Proceedings of the National Academy of Sciences.

[4]  Hanlee P. Ji,et al.  Pan-cancer analysis of the extent and consequences of intratumor heterogeneity , 2015, Nature Medicine.

[5]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[6]  F. Markowetz,et al.  Cancer Evolution: Mathematical Models and Computational Inference , 2014, Systematic biology.

[7]  Nicholas B. Larson,et al.  PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data , 2013, Bioinform..

[8]  David B. Dunson,et al.  Beta-Negative Binomial Process and Poisson Factor Analysis , 2011, AISTATS.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[11]  P. Müller,et al.  A method for combining inference across related nonparametric Bayesian models , 2004 .

[12]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[13]  Radford M. Neal Bayesian Mixture Modeling , 1992 .

[14]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[15]  K. Polyak,et al.  Intra-tumour heterogeneity: a looking glass for cancer? , 2012, Nature Reviews Cancer.

[16]  Ash A. Alizadeh,et al.  Toward understanding and exploiting tumor heterogeneity , 2015, Nature Medicine.

[17]  J. Vijg,et al.  Single-cell whole-genome sequencing reveals the functional landscape of somatic mutations in B lymphocytes across the human lifespan , 2019, Proceedings of the National Academy of Sciences.

[18]  Sunil Singhal,et al.  Changes in the local tumor microenvironment in recurrent cancers may explain the failure of vaccines after surgery , 2012, Proceedings of the National Academy of Sciences.

[19]  P. Campbell,et al.  Somatic mutation in cancer and normal cells , 2015, Science.

[20]  Eyke Hüllermeier,et al.  On the bayes-optimality of F-measure maximizers , 2013, J. Mach. Learn. Res..

[21]  N. Navin,et al.  The first five years of single-cell cancer genomics and beyond , 2015, Genome research.

[22]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[23]  P. A. Futreal,et al.  Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. , 2012, The New England journal of medicine.

[24]  W. J. Hall,et al.  ON CHARACTERIZATION OF THE GAMMA DISTRIBUTION. , 1968 .

[25]  A. Bouchard-Côté,et al.  PyClone: statistical inference of clonal population structure in cancer , 2014, Nature Methods.

[26]  Shankar Vembu,et al.  PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors , 2015, Genome Biology.

[27]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[28]  W. Cavenee,et al.  Heterogeneity maintenance in glioblastoma: a social network. , 2011, Cancer research.

[29]  Junfeng Wang,et al.  Inferring Clonal Composition from Multiple Sections of a Breast Cancer , 2014, PLoS Comput. Biol..

[30]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[31]  John Paisley,et al.  A Tutorial on the Dirichlet Process for Engineers Technical Report , 2009 .

[32]  D. Aldous Exchangeability and related topics , 1985 .

[33]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[34]  Carsten Denkert,et al.  Ioncopy: a novel method for calling copy number alterations in amplicon sequencing data including significance assessment , 2016, Oncotarget.

[35]  Ken Chen,et al.  Monovar: single nucleotide variant detection in single cells , 2016, Nature Methods.

[36]  Li Zhang,et al.  PurityEst: estimating purity of human tumor samples using next-generation sequencing data , 2012, Bioinform..

[37]  Yuan Ji,et al.  TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data , 2017, The Annals of Applied Statistics.

[38]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[39]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[40]  Yuan Ji,et al.  A Bayesian feature allocation model for tumor heterogeneity , 2015, 1509.04026.

[41]  Matthew T. Harrison,et al.  A simple example of Dirichlet process mixture inconsistency for the number of components , 2013, NIPS.

[42]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[43]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[44]  G. Bhanot,et al.  On Statistical Modeling of Sequencing Noise in High Depth Data to Assess Tumor Evolution , 2017, bioRxiv.

[45]  Florian Markowetz,et al.  A phylogenetic latent feature model for clonal deconvolution , 2016, 1604.01715.

[46]  Julian Parkhill,et al.  Single-cell genomics , 2008, Nature Reviews Microbiology.

[47]  Christopher J. R. Illingworth,et al.  High-Definition Reconstruction of Clonal Composition in Cancer , 2014, Cell reports.

[48]  Kun Yu,et al.  PureCN: copy number calling and SNV classification using targeted short read sequencing , 2016, Source Code for Biology and Medicine.

[49]  Lawrence Carin,et al.  Negative Binomial Process Count and Mixture Modeling , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Thomas L. Griffiths,et al.  The Phylogenetic Indian Buffet Process: A Non-Exchangeable Nonparametric Prior for Latent Features , 2008, UAI.

[51]  Yuan Ji,et al.  BayClone: Bayesian Nonparametric Inference of Tumor Subclones Using NGS Data , 2014, Pacific Symposium on Biocomputing.

[52]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[53]  Lawrence Carin,et al.  Augment-and-Conquer Negative Binomial Processes , 2012, NIPS.

[54]  J. Vijg,et al.  Single-cell whole-genome sequencing reveals the functional landscape of somatic mutations in B lymphocytes across the human lifespan , 2019, Proceedings of the National Academy of Sciences.

[55]  P. Nowell The clonal evolution of tumor cell populations. , 1976, Science.

[56]  Insuk Sohn,et al.  Spontaneous mutations in the single TTN gene represent high tumor mutation burden , 2020, npj Genomic Medicine.

[57]  H. Ishwaran,et al.  Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models , 2000 .

[58]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[59]  École d'été de probabilités de Saint-Flour,et al.  École d'été de probabilités de Saint-Flour XIII - 1983 , 1985 .

[60]  A. Butte,et al.  Systematic pan-cancer analysis of tumour purity , 2015, Nature Communications.

[61]  James Napier,et al.  Modeling the Subclonal Evolution of Cancer Cell Populations. , 2018, Cancer research.

[62]  I. Kyrochristos,et al.  Bulk and Single-Cell Next-Generation Sequencing: Individualizing Treatment for Colorectal Cancer , 2019, Cancers.

[63]  M. Stratton,et al.  Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing , 2008, Proceedings of the National Academy of Sciences.

[64]  Lawrence Carin,et al.  Negative Binomial Process Count and Mixture Modeling. , 2012, IEEE transactions on pattern analysis and machine intelligence.

[65]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[66]  Obi L. Griffith,et al.  SciClone: Inferring Clonal Architecture and Tracking the Spatial and Temporal Patterns of Tumor Evolution , 2014, PLoS Comput. Biol..

[67]  A. Singh,et al.  Single cell genome sequencing. , 2012, Current opinion in biotechnology.

[68]  K. Kinzler,et al.  Cancer genes and the pathways they control , 2004, Nature Medicine.

[69]  W. Koh,et al.  Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics , 2014, Proceedings of the National Academy of Sciences.

[70]  J. Hicks,et al.  Insight into the heterogeneity of breast cancer through next-generation sequencing. , 2011, The Journal of clinical investigation.

[71]  Matthew T. Harrison,et al.  Inconsistency of Pitman-Yor process mixtures for the number of components , 2013, J. Mach. Learn. Res..

[72]  H. Blum,et al.  FAT1 expression and mutations in adult acute lymphoblastic leukemia , 2014, Blood Cancer Journal.

[73]  Nicolai J. Birkbak,et al.  Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade , 2016, Science.

[74]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .