dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

Comprehensive identification and cataloging of copy number variation (CNVs) are essential to providing a complete view of human genetic variation and to finding diseased genes. Due to the large-scale sequencing and cost control whole-genome sequencing (WGS) data, low-coverage data is favorably disposed towards CNV identification. However, such low-coverage data is sensitive to noise and sequencing biases, which results in low resolution of CNV detection in past experimental designs for WGS datasets. In this paper, we present a control-free Dirichlet process Gaussian mixture model (dpGMM) based approach, to analyze the read depth (RD) of low-coverage WGS datasets for CNV discovery. First, noise and biases of the RD signals are corrected through the preprocessing step of dpGMM. Then we assume that RD signals across genomic regions follow a Gaussian mixture model (GMM) in which each Gaussian distribution is followed by a copy number state. Without requiring the number of Gaussian distributions, dpGMM builds a Dirichlet process (DP) GMM for RD signals and further uses a DP prior to infer the number of Gaussian models. After that, we apply dpGMM to simulation datasets with different coverages and individual datasets, and compare ours to three widely used RD-based pipelines, CNVnator, GROM-RD, and BIC-seq2. Simulation results demonstrate that our approach, dpGMM, has a high F1 score in both low- and high- coverage sequences. Also, the number of overlaps between CNVs detected in real data by ours and the standard benchmark is twice as much as that detected by other tools such as CNVnator and GROM-RD.

[1]  Wessel N. van Wieringen,et al.  CGHcall: Calling aberrations for array CGH tumor profiles. , 2008 .

[2]  Thomas M. Blomquist,et al.  Abstract 4150: Quantitative sequencing following PCR-driven library preparation with internal standard mixtures has improved analytical performance and lower cost. , 2013 .

[3]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[4]  Yupeng Cun,et al.  Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust , 2018, Nature Protocols.

[5]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[6]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[7]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[8]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[9]  Antonio Ortega,et al.  Sparse representation and Bayesian detection of genome copy number alterations from microarray data , 2008, Bioinform..

[10]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[11]  Yusuke Nakamura,et al.  PlatinumCNV: A Bayesian Gaussian mixture model for genotyping copy number polymorphisms using SNP array signal intensity data , 2011, Genetic epidemiology.

[12]  Xiguo Yuan,et al.  SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples , 2019, Genes & Genomics.

[13]  C. V. Jongeneel,et al.  Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort , 2012, BMC Genomics.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  O. Zobay Mean field inference for the Dirichlet process mixture model , 2009 .

[16]  Tomas W. Fitzgerald,et al.  Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization , 2007, Genome Biology.

[17]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[18]  Christoph Preuss,et al.  Genetics of heart failure in congenital heart disease. , 2013, The Canadian journal of cardiology.

[19]  Ao Li,et al.  A novel unsupervised learning model for detecting driver genes from pan-cancer data through matrix tri-factorization framework with pairwise similarities constraints , 2018, Neurocomputing.

[20]  Michael I. Jordan,et al.  Nonparametric empirical Bayes for the Dirichlet process mixture model , 2006, Stat. Comput..

[21]  Yu-ping Wang,et al.  Comparative Studies of Copy Number Variation Detection Methods for Next-Generation Sequencing Technologies , 2013, PloS one.

[22]  Nallur B Ramachandra,et al.  Type 2 diabetes mellitus disease risk genes identified by genome wide copy number variation scan in normal populations. , 2016, Diabetes research and clinical practice.

[23]  Pei Wang,et al.  Algorithms for calling gains and losses in array CGH data. , 2009, Methods in molecular biology.

[24]  Mingxiang Teng,et al.  Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data. , 2017, Genome research.

[25]  O L Lopez,et al.  Genome-wide copy-number variation study of psychosis in Alzheimer's disease , 2015, Translational Psychiatry.

[26]  Sun-Yuan Kung,et al.  Accurate detection of aneuploidies in array CGH and gene expression microarray data , 2004, Bioinform..

[27]  P. Park,et al.  Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants , 2016, Nucleic acids research.

[28]  Liying Yang,et al.  Detection of Significant Copy Number Variations From Multiple Samples in Next-Generation Sequencing Data , 2018, IEEE Transactions on NanoBioscience.

[29]  Hui Yang,et al.  Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR , 2015, Nature Protocols.

[30]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[31]  Joe W. Gray,et al.  Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas , 2001, Nature Genetics.

[32]  Wen Lv,et al.  Evidence for polymerase gamma, POLG1 variation in reduced mitochondrial DNA copy number in Parkinson's disease. , 2015, Parkinsonism & related disorders.

[33]  Mai S. Mabrouk,et al.  DETECTING AND ANALYZING COPY NUMBER ALTERNATIONS IN ARRAY-BASED CGH DATA , 2016 .

[34]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[35]  Junying Zhang,et al.  BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data , 2019, Digit. Signal Process..

[36]  Liying Yang,et al.  CONDEL: Detecting Copy Number Variation and Genotyping Deletion Zygosity from Single Tumor Samples Using Sequence Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Sylvia Richardson,et al.  Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model , 2006, Bioinform..

[38]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[39]  Masahiko Watanabe,et al.  Cerebellar Plasticity and Motor Learning Deficits in a Copy Number Variation Mouse Model of Autism , 2014, Nature Communications.

[40]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[41]  Jun Bai,et al.  A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[43]  Kenny Q. Ye,et al.  Statistical Applications in Genetics and Molecular Biology Genotype Copy Number Variations using Gaussian Mixture Models : Theory and Algorithms , 2012 .

[44]  Liying Yang,et al.  IntSIM: An Integrated Simulator of Next-Generation Sequencing Data , 2017, IEEE Transactions on Biomedical Engineering.

[45]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[46]  Sean D. Smith,et al.  GROM-RD: resolving genomic biases to improve read depth detection of copy number variants , 2015, PeerJ.

[47]  Wang Minghui,et al.  ExomeHMM: A Hidden Markov Model for Detecting Copy Number Variation Using Whole-Exome Sequencing Data , 2017 .

[48]  H. Ostrer,et al.  A versatile statistical analysis algorithm to detect genome copy number variation. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.