Extending long-range phasing and haplotype library imputation algorithms to very large and heterogeneous datasets

Background This paper describes the latest improvements to the long-range phasing and haplotype library imputation algorithms that enable them to successfully phase both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of long-range phasing could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Further, both long-range phasing and haplotype library imputation were not designed to deal with large amounts of missing data, which is inherent when using multiple SNP arrays. Methods Here, we developed methods which avoid the need for all-against-all searches by performing long-range phasing on subsets of individuals and then combing results. We also extended long-range phasing and haplotype library imputation algorithms to enable them to use different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of our phasing software AlphaPhase. Results A simulated dataset with one million individuals genotyped with the same set of 6,711 SNP for a single chromosome took two days to phase. A larger dataset with one million individuals genotyped with 49,579 SNP for a single chromosome took 14 days to phase. The percentage of correctly phased alleles at heterozygous loci was respectively 90.5% and 90.0% for the two datasets, which is comparable to the accuracy achieved with previous versions of AlphaPhase on smaller datasets. The phasing accuracy for datasets with different sets of markers was generally lower than that for datasets with one set of markers. For a simulated dataset with three sets of markers 2.8% of alleles at heterozygous positions were phased incorrectly whereas the equivalent figure with one set of markers was 0.6%. Conclusions The improved long-range phasing and haplotype library imputation algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. This will enable more powerful breeding and genetics research and application.

[1]  Gregor Gorjanc,et al.  Assessment of the performance of different hidden Markov models for imputation in animal breeding , 2017, bioRxiv.

[2]  Bruce Tier,et al.  A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes , 2011, Genetics Selection Evolution.

[3]  Gregor Gorjanc,et al.  AlphaSim: Software for Breeding Program Simulation , 2016, The plant genome.

[4]  P M VanRaden,et al.  Genomic imputation and evaluation using high-density Holstein genotypes. , 2013, Journal of dairy science.

[5]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[6]  Pall I. Olason,et al.  Detection of sharing by descent, long-range phasing and haplotype imputation , 2008, Nature Genetics.

[7]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[8]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[9]  P. Visscher Human Complex Trait Genetics in the 21st Century , 2016, Genetics.

[10]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[11]  R. Houston,et al.  Potential of genotyping-by-sequencing for genomic selection in livestock populations , 2015, Genetics Selection Evolution.

[12]  Michael E. Goddard,et al.  Genomic selection: A paradigm shift in animal breeding , 2016 .

[13]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[14]  L. Wain,et al.  Haplotype estimation for biobank scale datasets , 2016, Nature Genetics.

[15]  C. Lawley,et al.  SNPchiMp v.3: integrating and standardizing single nucleotide polymorphism data for livestock species , 2015, BMC Genomics.

[16]  J. Kearney,et al.  SNP Data Quality Control in a National Beef and Dairy Cattle System and Highly Accurate SNP Based Parentage Verification and Identification , 2017, bioRxiv.

[17]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[18]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[19]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[20]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[21]  A. Whalen,et al.  Identification of causal variants using one million individuals with whole–genome sequence information , 2018 .

[22]  Tad S Sonstegard,et al.  Genomic Selection in Dairy Cattle: The USDA Experience. , 2017, Annual review of animal biosciences.