CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data

MOTIVATION Copy number abnormalities (CNAs) represent an important type of genetic mutation that can lead to abnormal cell growth and proliferation. New high-throughput sequencing technologies promise comprehensive characterization of CNAs. In contrast to microarrays, where probe design follows a carefully developed protocol, reads represent a random sample from a library and may be prone to representation biases due to GC content and other factors. The discrimination between true and false positive CNAs becomes an important issue. RESULTS We present a novel approach, called CNAseg, to identify CNAs from second-generation sequencing data. It uses depth of coverage to estimate copy number states and flowcell-to-flowcell variability in cancer and normal samples to control the false positive rate. We tested the method using the COLO-829 melanoma cell line sequenced to 40-fold coverage. An extensive simulation scheme was developed to recreate different scenarios of copy number changes and depth of coverage by altering a real dataset with spiked-in CNAs. Comparison to alternative approaches using both real and simulated datasets showed that CNAseg achieves superior precision and improved sensitivity estimates. AVAILABILITY The CNAseg package and test data are available at http://www.compbio.group.cam.ac.uk/software.html.

[1]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[2]  Derek Y. Chiang,et al.  The landscape of somatic copy-number alteration across human cancers , 2010, Nature.

[3]  D. Karlis,et al.  Bayesian analysis of the differences of count data , 2006, Statistics in medicine.

[4]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[5]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[6]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[7]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[8]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[9]  M. Stratton,et al.  A census of amplified and overexpressed human cancer genes , 2010, Nature Reviews Cancer.

[10]  Seunghak Lee,et al.  A robust framework for detecting structural variations in a genome , 2008, ISMB.

[11]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[12]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[13]  Christopher A. Miller,et al.  A sequence-level map of chromosomal breakpoints in the MCF-7 breast cancer cell line yields insights into the evolution of a cancer genome. , 2009, Genome research.

[14]  J. G. Skellam The frequency distribution of the difference between two Poisson variates belonging to different populations. , 1946, Journal of the Royal Statistical Society. Series A.

[15]  Jane Fridlyand,et al.  Bioinformatics Original Paper a Comparison Study: Applying Segmentation to Array Cgh Data for Downstream Analyses , 2022 .

[16]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[17]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[18]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[19]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[20]  Todd R. Ogden,et al.  Wavelet Methods for Time Series Analysis , 2002 .

[21]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  C. Alkan,et al.  MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[24]  Andrew Menzies,et al.  Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. , 2007, Genome research.

[25]  Guy P. Nason,et al.  Wavelet Methods in Statistics with R , 2008 .

[26]  Brian J. Stevenson,et al.  Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line , 2009, Proceedings of the National Academy of Sciences.

[27]  E. Eichler,et al.  Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. , 2009, Genome research.

[28]  A. Børresen-Dale,et al.  COMPLEX LANDSCAPES OF SOMATIC REARRANGEMENT IN HUMAN BREAST CANCER GENOMES , 2009, Nature.

[29]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[30]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[31]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[32]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[33]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.