Family-Based Benchmarking of Copy Number Variation Detection Software

The analysis of structural variants, in particular of copy-number variations (CNVs), has proven valuable in unraveling the genetic basis of human diseases. Hence, a large number of algorithms have been developed for the detection of CNVs in SNP array signal intensity data. Using the European and African HapMap trio data, we undertook a comparative evaluation of six commonly used CNV detection software tools, namely Affymetrix Power Tools (APT), QuantiSNP, PennCNV, GLAD, R-gada and VEGA, and assessed their level of pair-wise prediction concordance. The tool-specific CNV prediction accuracy was assessed in silico by way of intra-familial validation. Software tools differed greatly in terms of the number and length of the CNVs predicted as well as the number of markers included in a CNV. All software tools predicted substantially more deletions than duplications. Intra-familial validation revealed consistently low levels of prediction accuracy as measured by the proportion of validated CNVs (34-60%). Moreover, up to 20% of apparent family-based validations were found to be due to chance alone. Software using Hidden Markov models (HMM) showed a trend to predict fewer CNVs than segmentation-based algorithms albeit with greater validity. PennCNV yielded the highest prediction accuracy (60.9%). Finally, the pairwise concordance of CNV prediction was found to vary widely with the software tools involved. We recommend HMM-based software, in particular PennCNV, rather than segmentation-based algorithms when validity is the primary concern of CNV detection. QuantiSNP may be used as an additional tool to detect sets of CNVs not detectable by the other tools. Our study also reemphasizes the need for laboratory-based validation, such as qPCR, of CNVs predicted in silico.

[1]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[2]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[3]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[4]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[5]  Philip M. Kim,et al.  Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome , 2007, Science.

[6]  Luc Girard,et al.  An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. , 2004, Cancer research.

[7]  Ney Alliey-Rodriguez,et al.  Accuracy of CNV Detection from GWAS Data , 2011, PloS one.

[8]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[9]  R. Scharpf,et al.  A multilevel model to address batch effects in copy number estimation using SNP arrays. , 2011, Biostatistics.

[10]  Judy H Cho,et al.  Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease , 2008, Nature Genetics.

[11]  Alberto Piazza,et al.  Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants , 2009, Nature Genetics.

[12]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[13]  Ryan Mills,et al.  Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants , 2011, Nature Biotechnology.

[14]  Peng Chen,et al.  Deep whole-genome sequencing of 100 southeast Asian Malays. , 2013, American journal of human genetics.

[15]  K. Gunderson,et al.  A genome-wide scalable SNP genotyping assay using microarray technology , 2005, Nature Genetics.

[16]  Michele Ceccarelli,et al.  VEGA: variational segmentation for copy number detection , 2010, Bioinform..

[17]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[18]  Terence P. Speed,et al.  Estimation and assessment of raw copy numbers at the single locus level , 2008, Bioinform..

[19]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[20]  Shigeru Chiba,et al.  A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. , 2005, Cancer research.

[21]  Yusuke Nakamura,et al.  Population-genetic nature of copy number variations in the human genome , 2009, Human molecular genetics.

[22]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[23]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[24]  K. Gunderson,et al.  High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. , 2006, Genome research.

[25]  S. Wild,et al.  Copy Number Variation across European Populations , 2011, PloS one.

[26]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[27]  E. Eichler,et al.  Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. , 2006, American journal of human genetics.

[28]  Christopher Yau,et al.  Comparing CNV detection methods for SNP arrays. , 2009, Briefings in functional genomics & proteomics.

[29]  Thomas W. Mühleisen,et al.  Large recurrent microdeletions associated with schizophrenia , 2008, Nature.

[30]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[31]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[32]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[33]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[34]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[35]  Dongwan Hong,et al.  Reference-unbiased copy number variant analysis using CGH microarrays , 2010, Nucleic acids research.

[36]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[37]  Xin Jin,et al.  An exome sequencing pipeline for identifying and genotyping common CNVs associated with disease with application to psoriasis , 2012, Bioinform..

[38]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[39]  Juan R. González,et al.  R-Gada: a fast and flexible pipeline for copy number analysis in association studies , 2010, BMC Bioinformatics.

[40]  M. Hurles,et al.  Large, rare chromosomal deletions associated with severe early-onset obesity , 2010, Nature.

[41]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[42]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[43]  Bradley P. Coe,et al.  The genetic variability and commonality of neurodevelopmental disease , 2012, American journal of medical genetics. Part C, Seminars in medical genetics.

[44]  N. Carter Methods and strategies for analyzing copy number variation using DNA microarrays , 2007, Nature Genetics.

[45]  S. Levy,et al.  Whole Genome Distribution and Ethnic Differentiation of Copy Number Variation in Caucasian and Asian Populations , 2009, PloS one.

[46]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[47]  Marco A. Marra,et al.  Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data , 2007, BMC Bioinformatics.

[48]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[49]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[50]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[51]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[52]  T. LaFramboise,et al.  Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances , 2009, Nucleic acids research.

[53]  Thomas D. Wu,et al.  A highly annotated whole-genome sequence of a Korean individual , 2009, Nature.

[54]  Gonçalo Abecasis,et al.  Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis , 2009, Nature Genetics.

[55]  C. Yau,et al.  QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data , 2007, Nucleic acids research.

[56]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[57]  Terence P. Speed,et al.  A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6 , 2009, Bioinform..

[58]  Emmanuel Barillot,et al.  Analysis of array CGH data: from signal ratio to gain and loss of DNA regions , 2004, Bioinform..