Evaluation of consensus strategies for haplotype phasing.

Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any individual estimate. However, such a strategy is yet to be thoroughly explored. This study provides a comprehensive evaluation of consensus strategies for haplotype phasing. We explore the performance of different consensus paradigms, and the effect of specific constituent tools, across several datasets with different characteristics and their impact on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find that the consensus approach from multiple tools reduces SE by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, sample size, variant density or variant frequency. Furthermore, the consensus estimator improves the accuracy of the downstream task of genotype imputation carried out by the widely used Minimac3, pbwt and BEAGLE5 tools. Our results provide guidance on how to produce the most accurate phasing estimates and the trade-offs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at https://github.com/ziadbkh/consHap. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

[1]  Brian L Browning,et al.  A One-Penny Imputed Genome from Next-Generation Reference Panels. , 2018, American journal of human genetics.

[2]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[3]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[4]  Marylyn D. Ritchie,et al.  Imputation and quality control steps for combining multiple genome-wide datasets , 2014, Front. Genet..

[5]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[6]  Kellen DeLaney,et al.  Nanosecond photochemically promoted click chemistry for enhanced neuropeptide visualization and rapid protein labeling , 2019, Nature Communications.

[7]  L. Wain,et al.  Haplotype estimation for biobank scale datasets , 2016, Nature Genetics.

[8]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[9]  Saurabh Belsare,et al.  Evaluating the quality of the 1000 genomes project data , 2018, BMC Genomics.

[10]  E. Kirkness,et al.  Comparison of phasing strategies for whole human genomes , 2018, PLoS genetics.

[11]  Matthieu Keller,et al.  The olfactory secretome varies according to season in female sheep and goat , 2019, BMC genomics.

[12]  Karin M. Verspoor,et al.  Exploring effective approaches for haplotype block phasing , 2019, BMC Bioinformatics.

[13]  Céline Bellenguez,et al.  Strategies for phasing and imputation in a population isolate , 2018, Genetic epidemiology.

[14]  B. Browning,et al.  Efficient multilocus association testing for whole genome association studies using localized haplotype clustering , 2007, Genetic epidemiology.

[15]  Matthew R. Robinson,et al.  Accurate, scalable and integrative haplotype estimation , 2019, Nature Communications.

[16]  David Reich,et al.  Phasing of many thousands of genotyped samples. , 2012, American journal of human genetics.

[17]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[18]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[19]  Po-Ru Loh,et al.  Fast and accurate long-range phasing in a UK Biobank cohort , 2015, Nature Genetics.

[20]  Peter Kraft,et al.  Quality control and quality assurance in genotypic data for genome‐wide association studies , 2010, Genetic epidemiology.

[21]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[22]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[23]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[24]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[26]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[27]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[28]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[29]  Pierre Geurts,et al.  A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes , 2019, Front. Genet..