Statistical challenges associated with detecting copy number variations with next-generation sequencing

MOTIVATION Analysing next-generation sequencing (NGS) data for copy number variations (CNVs) detection is a relatively new and challenging field, with no accepted standard protocols or quality control measures so far. There are by now several algorithms developed for each of the four broad methods for CNV detection using NGS, namely the depth of coverage (DOC), read-pair, split-read and assembly-based methods. However, because of the complexity of the genome and the short read lengths from NGS technology, there are still many challenges associated with the analysis of NGS data for CNVs, no matter which method or algorithm is used. RESULTS In this review, we describe and discuss areas of potential biases in CNV detection for each of the four methods. In particular, we focus on issues pertaining to (i) mappability, (ii) GC-content bias, (iii) quality control measures of reads and (iv) difficulty in identifying duplications. To gain insights to some of the issues discussed, we also download real data from the 1000 Genomes Project and analyse its DOC data. We show examples of how reads in repeated regions can affect CNV detection, demonstrate current GC-correction algorithms, investigate sensitivity of DOC algorithm before and after quality control of reads and discuss reasons for which duplications are harder to detect than deletions.

[1]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[2]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[3]  J. J. Shen,et al.  Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing , 2012, 1206.6627.

[4]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[5]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2012, Nature Reviews Genetics.

[6]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[7]  E. Eichler,et al.  Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. , 2009, Genome research.

[8]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[9]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[10]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[11]  Bjarne Knudsen,et al.  A Computer Simulator for Assessing Different Challenges and Strategies of de Novo Sequence Assembly , 2010, Genes.

[12]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[13]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Eleazar Eskin,et al.  Efficient algorithms for tandem copy number variation reconstruction in repeat-rich regions , 2011, Bioinform..

[15]  John Quackenbush,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV , 2011, Bioinform..

[16]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[17]  Ira M. Hall,et al.  Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. , 2010, Genome research.

[18]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[19]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[20]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[21]  Tae-Min Kim,et al.  Detecting structural variations in the human genome using next generation sequencing. , 2010, Briefings in functional genomics.

[22]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[23]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[24]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[25]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[26]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[27]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[28]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[29]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[30]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[31]  Lars Feuk,et al.  Inversion variants in the human genome: role in disease and genome architecture , 2010, Genome Medicine.

[32]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[33]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[34]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[35]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[36]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[37]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[38]  Derek Y. Chiang,et al.  Mapping duplicated sequences , 2009, Nature Biotechnology.

[39]  Emmanuel Barillot,et al.  SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data , 2010, Bioinform..

[40]  Süleyman Cenk Sahinalp,et al.  Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes , 2009, RECOMB.

[41]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[42]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[43]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[44]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[45]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[46]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[47]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[48]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[49]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.