CNVkit: Copy number detection and visualization for targeted sequencing using off-target reads

Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massive parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for compatibility with other software. CNVkit is freely availabile from http://github.com/etal/cnvkit.

[1]  Tatiana Popova,et al.  Multi-factor data normalization enables the detection of copy number aberrations in amplicon sequencing data , 2014, Bioinform..

[2]  Han Fang,et al.  "Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples" , 2014 .

[3]  Ira M. Hall,et al.  SAMBLASTER: fast duplicate marking and structural variant read extraction , 2014, Bioinform..

[4]  B. Giusti,et al.  EXCAVATOR: detecting copy number variants from whole-exome sequencing data , 2013, Genome Biology.

[5]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[6]  Benjamin J. Raphael,et al.  THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data , 2013, Genome Biology.

[7]  E. Banks,et al.  Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. , 2012, American journal of human genetics.

[8]  Jamie K Teer,et al.  Comparative exome sequencing of metastatic lesions provides insights into the mutational progression of melanoma , 2012, BMC Genomics.

[9]  Bradley P. Coe,et al.  Copy number variation detection and genotyping from exome sequence data , 2012, Genome research.

[10]  Jason Li,et al.  CONTRA: copy number analysis for targeted resequencing , 2012, Bioinform..

[11]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[12]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[13]  John Quackenbush,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV , 2011, Bioinform..

[14]  Paul T. Spellman,et al.  Parent-specific copy number in paired tumor-normal studies using circular binary segmentation , 2011, Bioinform..

[15]  Emmanuel Barillot,et al.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[16]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[17]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[18]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[19]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[20]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[23]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[24]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[25]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[26]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[27]  John A. Randal,et al.  A reinvestigation of robust scale estimation in finite samples , 2008, Comput. Stat. Data Anal..

[28]  Yonina C. Eldar,et al.  A fast and flexible method for the segmentation of aCGH data , 2008, ECCB.

[29]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[30]  Hanlee P. Ji,et al.  Multigene amplification and massively parallel sequencing for cancer mutation discovery , 2007, Proceedings of the National Academy of Sciences.

[31]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[32]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[33]  W. Kuo,et al.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays , 1998, Nature Genetics.

[34]  D. Lax Robust Estimators of Scale: Finite-Sample Performance in Long-Tailed Symmetric Distributions , 1985 .

[35]  R. Schafer,et al.  On the use of the I 0 -sinh window for spectrum analysis , 1980 .

[36]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .