Accurately Estimating Tumor Purity of Samples with High Degree of Heterogeneity from Cancer Sequencing Data

Tumor purity is the proportion of tumor cells in the sampled admixture. Estimating tumor purity is one of the key steps for both understanding the tumor micro-environment and reducing false positives and false negatives in the genomic analysis. However, existing approaches often lose some accuracy when analyzing the samples with high degree of heterogeneity. The patterns of clonal architecture shown in sequencing data interfere with the data signals that the purity estimation algorithms expect. In this article, we propose a computational method, EMPurity, which is able to accurately infer the tumor purity of the samples with high degree of heterogeneity. EMPurity captures the patterns of both the tumor purity and clonal structure by a probabilistic model. The model parameters are directly calculated from aligned reads, which prevents the errors transferring from the variant calling results. We test EMPurity on a series of datasets comparing to three popular approaches, and EMPurity outperforms them on different simulation configurations.

[1]  Eric S. Lander,et al.  The genomic complexity of primary human prostate cancer , 2010, Nature.

[2]  Benjamin J. Raphael,et al.  THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data , 2013, Genome Biology.

[3]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[4]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[5]  Sohrab P. Shah,et al.  JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data , 2012, Bioinform..

[6]  Joshua F. McMichael,et al.  Age-related cancer mutations associated with clonal hematopoietic expansion , 2014, Nature Medicine.

[7]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[8]  G. Getz,et al.  Inferring tumour purity and stromal and immune cell admixture from expression data , 2013, Nature Communications.

[9]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[10]  Li Zhang,et al.  PurityEst: estimating purity of human tumor samples using next-generation sequencing data , 2012, Bioinform..

[11]  A. McKenna,et al.  Absolute quantification of somatic DNA alterations in human cancer , 2012, Nature Biotechnology.

[12]  C. Perou,et al.  Allele-specific copy number analysis of tumors , 2010, Proceedings of the National Academy of Sciences.

[13]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[14]  Obi L. Griffith,et al.  SciClone: Inferring Clonal Architecture and Tracking the Spatial and Temporal Patterns of Tumor Evolution , 2014, PLoS Comput. Biol..

[15]  Nicholas B. Larson,et al.  PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data , 2013, Bioinform..

[16]  Henry M. Wood,et al.  Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data , 2012, Bioinform..