tHapMix: simulating tumour samples through haplotype mixtures

Motivation: Large-scale rearrangements and copy number changes combined with different modes of clonal evolution create extensive somatic genome diversity, making it difficult to develop versatile and scalable variant calling tools and create well-calibrated benchmarks. Results: We developed a new simulation framework tHapMix that enables the creation of tumour samples with different ploidy, purity and polyclonality features. It easily scales to simulation of hundreds of somatic genomes, while re-use of real read data preserves noise and biases present in sequencing platforms. We further demonstrate tHapMix utility by creating a simulated set of 140 somatic genomes and showing how it can be used in training and testing of somatic copy number variant calling tools. Availability and implementation: tHapMix is distributed under an open source license and can be downloaded from https://github.com/Illumina/tHapMix. Contact: sivakhno@illumina.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  J. Salk Clonal evolution in cancer , 2010 .

[2]  Carlo C. Maley,et al.  Clonal evolution in cancer , 2012, Nature.

[3]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[4]  Gil McVean,et al.  A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016 .

[5]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[6]  Noemi Andor,et al.  EXPANDS: expanding ploidy and allele frequency on nested subpopulations , 2013, Bioinform..

[7]  Qiang Hu,et al.  SCNVSim: somatic copy number variation and structure variation simulator , 2015, BMC Bioinformatics.

[8]  W. Kloosterman,et al.  The genomic characteristics and cellular origin of chromothripsis. , 2016, Current opinion in cell biology.

[9]  Michael C. Schatz,et al.  Teaser: Individualized benchmarking and optimization of read mapping results for NGS data , 2015, bioRxiv.

[10]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[11]  P. Fryzlewicz Unbalanced Haar Technique for Nonparametric Function Estimation , 2007 .

[12]  Sohrab P. Shah,et al.  TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data , 2014, Genome research.

[13]  Steven J. M. Jones,et al.  A somatic reference standard for cancer genome sequencing , 2016, Scientific Reports.

[14]  A. Børresen-Dale,et al.  The Life History of 21 Breast Cancers , 2012, Cell.

[15]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[16]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[17]  Steve Lee,et al.  Canvas: versatile and scalable detection of copy number variants , 2016, bioRxiv.

[18]  A. McKenna,et al.  Absolute quantification of somatic DNA alterations in human cancer , 2012, Nature Biotechnology.

[19]  Benjamin J. Raphael,et al.  THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data , 2013, Genome Biology.

[20]  Mark Gerstein,et al.  VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications , 2014, Bioinform..

[21]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[22]  Ash A. Alizadeh,et al.  Toward understanding and exploiting tumor heterogeneity , 2015, Nature Medicine.

[23]  Chris D. Greenman,et al.  The Relative Timing of Mutations in a Breast Cancer Genome , 2013, PloS one.

[24]  Shankar Vembu,et al.  PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors , 2015, Genome Biology.

[25]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.