A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes

BackgroundKnowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data.MethodsA long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information.ResultsThe algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available.ConclusionsThe algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets.

[1]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[2]  Daniel F. Gudbjartsson,et al.  Parental origin of sequence variants associated with complex diseases , 2009, Nature.

[3]  J. Grefenstette,et al.  High-resolution haplotype block structure in the cattle genome , 2009, BMC Genetics.

[4]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[5]  Igor Rudan,et al.  Runs of homozygosity in European populations. , 2008, American journal of human genetics.

[6]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[7]  J. McEwan,et al.  An examination of positive selection and changing effective population size in Angus and Holstein cattle populations (Bos taurus) using a high density SNP genotyping platform and the contribution of ancient polymorphism to genomic diversity in Domestic cattle , 2009, BMC Genomics.

[8]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[9]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[10]  T. Druet,et al.  Marker imputation with low-density marker panels in Dutch Holstein cattle. , 2010, Journal of dairy science.

[11]  P. VanRaden,et al.  Prediction of unobserved single nucleotide polymorphism genotypes of Jersey cattle using reference panels and population-based imputation algorithms. , 2010, Journal of dairy science.

[12]  John A Woolliams,et al.  Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing , 2011, Genetics.

[13]  R. Fernando,et al.  Genomic Selection Using Low-Density Marker Panels , 2009, Genetics.

[14]  Pall I. Olason,et al.  Detection of sharing by descent, long-range phasing and haplotype imputation , 2008, Nature Genetics.