A sequential Monte Carlo framework for haplotype inference in CNV/SNP genotype data

Copy number variations (CNVs) are abundant in the human genome. They have been associated with complex traits in genome-wide association studies (GWAS) and expected to continue playing an important role in identifying the etiology of disease phenotypes. As a result of current high throughput whole-genome single-nucleotide polymorphism (SNP) arrays, we currently have datasets that simultaneously have integer copy numbers in CNV regions as well as SNP genotypes. At the same time, haplotypes that have been shown to offer advantages over genotypes in identifying disease traits even though available for SNP genotypes are largely not available for CNV/SNP data due to insufficient computational tools. We introduce a new framework for inferring haplotypes in CNV/SNP data using a sequential Monte Carlo sampling scheme ‘Tree-Based Deterministic Sampling CNV’ (TDSCNV). We compare our method with polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets of varying number of markers. We have found that both algorithms show similar accuracy but TDSCNV is an order of magnitude faster while scaling linearly with the number of markers and number of individuals and thus could be the method of choice for haplotype inference in such datasets. Our method is implemented in the TDSCNV package which is available for download at http://www.ee.columbia.edu/~anastas/tdscnv.

[1]  Lachlan James M. Coin,et al.  Inferring combined CNV/SNP haplotypes from genotype data , 2010, Bioinform..

[2]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[3]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[4]  Jonathan White,et al.  Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions , 2008, BMC Bioinformatics.

[5]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[6]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[7]  R. Griffiths,et al.  Bounds on the minimum number of recombination events in a sample history. , 2003, Genetics.

[8]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[9]  Matthew E Hurles,et al.  The population genetics of structural variation , 2007, Nature Genetics.

[10]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[11]  R. Griffiths,et al.  Inference from gene trees in a subdivided population. , 2000, Theoretical population biology.

[12]  G. McVean,et al.  Estimating recombination rates from population-genetic data , 2003, Nature Reviews Genetics.

[13]  Yusuke Nakamura,et al.  MOCSphaser: a haplotype inference tool from a mixture of copy number variation and single nucleotide polymorphism data , 2008, Bioinform..

[14]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..

[15]  Sharon R. Browning,et al.  Missing data imputation and haplotype phase inference for genome-wide association studies , 2008, Human Genetics.

[16]  S. Mccarroll,et al.  Copy-number variation and association studies of human disease , 2007, Nature Genetics.

[17]  A. Chakravarti,et al.  Haplotype and missing data inference in nuclear families. , 2004, Genome research.

[18]  Bonnie Kirkpatrick,et al.  HAPLOPOOL: improving haplotype frequency estimation through DNA pools and phylogenetic modeling , 2007, Bioinform..

[19]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[20]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[21]  Xiaodong Wang,et al.  Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data , 2012, BMC Genetics.

[22]  Yusuke Nakamura,et al.  An algorithm for inferring complex haplotypes in a region of copy-number variation. , 2008, American journal of human genetics.

[23]  Peter Beerli,et al.  Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Dimitris Anastassiou,et al.  A haplotype inference algorithm for trios based on deterministic sampling , 2009, BMC Genetics.

[25]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.