Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

BackgroundSomatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. Existing SCNA based subclonal population inferring tools consider the GC bias of tumor and normal sample is of the same fature, and could be fully offset by read count ratio. However, we found that, the read count ratio on SCNA segments presents a Log linear biased pattern, which influence existing read count ratios based subclonal inferring tools performance. Currently no correction tools take into account the read ratio bias.ResultsWe present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods at any level of subclonal population.ConclusionsPre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.

[1]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[2]  Anne E Carpenter,et al.  Visualization of image data from cells to organisms , 2010, Nature Methods.

[3]  Benjamin J. Raphael,et al.  THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data , 2013, Genome Biology.

[4]  Tao Jiang,et al.  Accurate inference of isoforms from multiple sample RNA-Seq data , 2015, BMC Genomics.

[5]  Jiajie Peng,et al.  Identifying consistent disease subnetworks using DNet. , 2017, Methods.

[6]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[7]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[8]  Yadong Wang,et al.  A novel method to measure the semantic similarity of HPO terms , 2017, Int. J. Data Min. Bioinform..

[9]  Tae-Min Kim,et al.  BIC-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data , 2010, Genome Biology.

[10]  Derek Y. Chiang,et al.  The landscape of somatic copy-number alteration across human cancers , 2010, Nature.

[11]  Jean Thierry-Mieg,et al.  Predictable dynamic program of timing of DNA replication in human cells. , 2009, Genome research.

[12]  Andrea E. Wishart,et al.  Genomic copy number variation in Mus musculus , 2015, BMC Genomics.

[13]  Shuhui Liu,et al.  Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach , 2018, BMC Systems Biology.

[14]  Henry M. Wood,et al.  Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data , 2012, Bioinform..

[15]  Yadong Wang,et al.  Identifying term relations cross different gene ontology categories , 2017, BMC Bioinformatics.

[16]  Zohar Yakhini,et al.  Global organization of replication time zones of the mouse genome. , 2008, Genome research.

[17]  Yi Li,et al.  MixClone: a mixture model for inferring tumor subclonal populations , 2015, BMC Genomics.

[18]  P. Nowell The clonal evolution of tumor cell populations. , 1976, Science.

[19]  Mihai Pop,et al.  Genomic characterization of the Yersinia genus , 2010, Genome Biology.

[20]  A. McKenna,et al.  Absolute quantification of somatic DNA alterations in human cancer , 2012, Nature Biotechnology.