A machine-learning approach for accurate detection of copy-number variants from exome sequencing

Copy-number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome sequencing data are limited by high false positive rates and low concordance due to the inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn-diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, as it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework (https://github.com/girirajanlab/CN_Learn) that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (∼90%) and recall (∼85%) rates while maintaining robust performance even when trained with minimal data (∼30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance and GC content providing the most discriminatory power. In fact, about 58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.

[1]  Santhosh Girirajan,et al.  Human copy number variation and complex genetic disease. , 2011, Annual review of genetics.

[2]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[3]  Kevin Y. Yip,et al.  Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays , 2015, Genetics.

[4]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[5]  Celine S. Hong,et al.  Assessing the reproducibility of exome copy number variations predictions , 2016, Genome Medicine.

[6]  E. Banks,et al.  Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. , 2012, American journal of human genetics.

[7]  M. Hurles,et al.  De Novo and Rare Variants at Multiple Loci Support the Oligogenic Origins of Atrioventricular Septal Heart Defects , 2016, PLoS genetics.

[8]  Aaron M. Newman,et al.  The genome sequence of the colonial chordate, Botryllus schlosseri , 2013, eLife.

[9]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[10]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[11]  John Quackenbush,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV , 2011, Bioinform..

[12]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[13]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[14]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[15]  Xiaolin Zhu,et al.  An Evaluation of Copy Number Variation Detection Tools from Whole‐Exome Sequencing Data , 2014, Human mutation.

[16]  M. Tekin,et al.  Comprehensive Analysis via Exome Sequencing Uncovers Genetic Etiology in Autosomal Recessive Non-Syndromic Deafness in a Large Multiethnic Cohort , 2015, Genetics in Medicine.

[17]  Todd Richmond,et al.  Detection of Clinically Relevant Copy Number Variants with Whole‐Exome Sequencing , 2013, Human mutation.

[18]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[19]  Nancy R. Zhang,et al.  CODEX: a normalization and copy number variation detection method for whole exome sequencing , 2015, Nucleic acids research.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[22]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[25]  W. Chung,et al.  Clinical application of whole-exome sequencing across clinical indications , 2015, Genetics in Medicine.

[26]  Frederick E. Dewey,et al.  CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data , 2015, Bioinform..

[27]  Niklas Krumm,et al.  Transmission disequilibrium of small CNVs in simplex autism. , 2013, American journal of human genetics.

[28]  T. Montine,et al.  Glucocerebrosidase Deficiency in Drosophila Results in α-Synuclein-Independent Protein Aggregation and Neurodegeneration , 2016, PLoS genetics.

[29]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[30]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[31]  Michael R. Johnson,et al.  Copy number variant analysis from exome data in 349 patients with epileptic encephalopathy , 2015, Annals of neurology.

[32]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[33]  P. Kwok,et al.  Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly , 2012, Nature Biotechnology.

[34]  Rodolphe Barrangou,et al.  Human Copy Number Variation and Complex Genetic Disease , 2014 .

[35]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[36]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[37]  N. Matoba,et al.  Exome sequencing for bipolar disorder points to roles of de novo loss-of-function and protein-altering mutations , 2016, Molecular Psychiatry.

[38]  Yufeng Shen,et al.  CANOES: detecting rare copy number variants from whole exome sequencing data , 2014, Nucleic acids research.

[39]  T. Shaikh,et al.  Clinical impact of copy number variation analysis using high-resolution microarray technologies: advantages, limitations and concerns , 2012, Genome Medicine.

[40]  Clara Gaff,et al.  Patient safety in genomic medicine: an exploratory study , 2016, Genetics in Medicine.

[41]  Irving E. Wang,et al.  Tissue absence initiates regeneration through Follistatin-mediated inhibition of Activin signaling , 2013, eLife.

[42]  Bradley P. Coe,et al.  Maternal Modifiers and Parent-of-Origin Bias of the Autism-Associated 16p11.2 CNV. , 2016, American journal of human genetics.

[43]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[44]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[45]  Fengtang Yang,et al.  Copy number variation and evolution in humans and chimpanzees. , 2008, Genome research.

[46]  B. Stoddard,et al.  Editorial: NAR Surveys the Past, Present and Future of Restriction Endonucleases , 2013, Nucleic acids research.

[47]  Kali T. Witherspoon,et al.  Excess of rare, inherited truncating mutations in autism , 2015, Nature Genetics.

[48]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[49]  Clara Gaff,et al.  Diagnostic Impact and Cost-effectiveness of Whole-Exome Sequencing for Ambulant Children With Suspected Monogenic Conditions , 2017, JAMA pediatrics.

[50]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[51]  Yiping Shen,et al.  Evaluation of three read-depth based CNV detection tools using whole-exome sequencing data , 2017, Molecular Cytogenetics.

[52]  Leslie G Biesecker,et al.  Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. , 2010, American journal of human genetics.

[53]  The Simons,et al.  Simons Variation in Individuals Project (Simons VIP): A Genetics-First Approach to Studying Autism Spectrum and Related Neurodevelopmental Disorders , 2012, Neuron.

[54]  Avi Ma'ayan,et al.  Identification of small exonic CNV from whole-exome sequence data and application to autism spectrum disorder. , 2013, American journal of human genetics.

[55]  Charles Y. Chiu,et al.  Erratum to: Clinical metagenomic identification of Balamuthia mandrillaris encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing , 2016, Genome Medicine.

[56]  Bradley P. Coe,et al.  Copy number variation detection and genotyping from exome sequence data , 2012, Genome research.