Exploring effective approaches for haplotype block phasing

BackgroundKnowledge of phase, the specific allele sequence on each copy of homologous chromosomes, is increasingly recognized as critical for detecting certain classes of disease-associated mutations. One approach for detecting such mutations is through phased haplotype association analysis. While the accuracy of methods for phasing genotype data has been widely explored, there has been little attention given to phasing accuracy at haplotype block scale. Understanding the combined impact of the accuracy of phasing tool and the method used to determine haplotype blocks on the error rate within the determined blocks is essential to conduct accurate haplotype analyses.ResultsWe present a systematic study exploring the relationship between seven widely used phasing methods and two common methods for determining haplotype blocks. The evaluation focuses on the number of haplotype blocks that are incorrectly phased. Insights from these results are used to develop a haplotype estimator based on a consensus of three tools. The consensus estimator achieved the most accurate phasing in all applied tests. Individually, EAGLE2, BEAGLE and SHAPEIT2 alternate in being the best performing tool in different scenarios. Determining haplotype blocks based on linkage disequilibrium leads to more correctly phased blocks compared to a sliding window approach. We find that there is little difference between phasing sections of a genome (e.g. a gene) compared to phasing entire chromosomes. Finally, we show that the location of phasing error varies when the tools are applied to the same data several times, a finding that could be important for downstream analyses.ConclusionsThe choice of phasing and block determination algorithms and their interaction impacts the accuracy of phased haplotype blocks. This work provides guidance and evidence for the different design choices needed for analyses using haplotype blocks. The study highlights a number of issues that may have limited the replicability of previous haplotype analysis.

[1]  Andreas Ziegler,et al.  Genome-Wide Haplotype Analysis of Cis Expression Quantitative Trait Loci in Monocytes , 2013, PLoS genetics.

[2]  David Reich,et al.  Phasing of many thousands of genotyped samples. , 2012, American journal of human genetics.

[3]  W. Barendse Haplotype Analysis Improved Evidence for Candidate Genes for Intramuscular Fat Percentage from a Genome Wide Association Study of Cattle , 2011, PloS one.

[4]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[5]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[6]  Yongshuai Jiang,et al.  Genome-wide haplotype association study identify TNFRSF1A, CASP7, LRP1B, CDH1 and TG genes associated with Alzheimer's disease in Caribbean Hispanic individuals , 2015, Oncotarget.

[7]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[8]  B. Browning,et al.  Efficient multilocus association testing for whole genome association studies using localized haplotype clustering , 2007, Genetic epidemiology.

[9]  F. Schenkel,et al.  A comparison of different algorithms for phasing haplotypes using Holstein cattle genotypes and pedigree data. , 2017, Journal of dairy science.

[10]  D. A. Tregouet,et al.  A new JAVA interface implementation of THESIAS: testing haplotype effects in association studies , 2007, Bioinform..

[11]  Amr Badr,et al.  On Predicting Conformational B-cell Epitopes: a Comparative Study and a New Model , 2012 .

[12]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[13]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[14]  T. Wong,et al.  Combined genotype and haplotype tests for region-based association studies , 2013, BMC Genomics.

[15]  Pak Chung Sham,et al.  A powerful approach reveals numerous expression quantitative trait haplotypes in multiple tissues , 2018, Bioinform..

[16]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[17]  P. Deloukas,et al.  Multiple common variants for celiac disease influencing immune gene expression , 2010, Nature Genetics.

[18]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[19]  David M Howard,et al.  Genome-wide haplotype-based association analysis of major depressive disorder in Generation Scotland and UK Biobank , 2016, bioRxiv.

[20]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[21]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[22]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[23]  Po-Ru Loh,et al.  Fast and accurate long-range phasing in a UK Biobank cohort , 2015, Nature Genetics.

[24]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[25]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[26]  David Curtis,et al.  Estimated haplotype counts from case-control samples cannot be treated as observed counts. , 2006, American journal of human genetics.

[27]  M. Daly,et al.  Haplotype-based association analysis of 56 functional candidate genes in the IBD6 locus on chromosome 19 , 2006, European Journal of Human Genetics.

[28]  Céline Bellenguez,et al.  Strategies for phasing and imputation in a population isolate , 2018, Genetic epidemiology.

[29]  Robert Brown,et al.  Enhanced methods to detect haplotypic effects on gene expression , 2017, Bioinform..

[30]  Yanfang Guo,et al.  Gains in power for exhaustive analyses of haplotypes using variable-sized sliding window strategy: a comparison of association-mapping strategies , 2009, European Journal of Human Genetics.

[31]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[32]  Xue Gao,et al.  Genome-Wide Association Studies Using Haplotypes and Individual SNPs in Simmental Cattle , 2014, PloS one.

[33]  Z. Ping,et al.  Haplotype-based interaction of the PPARGC1A and UCP1 genes is associated with impaired fasting glucose or type 2 diabetes mellitus , 2017, Medicine.

[34]  Ian M. Morison,et al.  Integrated Genetic and Epigenetic Analysis Identifies Haplotype-Specific Methylation in the FTO Type 2 Diabetes and Obesity Susceptibility Locus , 2010, PloS one.

[35]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[36]  L. Wain,et al.  Haplotype estimation for biobank scale datasets , 2016, Nature Genetics.

[37]  Leon Wenliang Zhong,et al.  Accurate Probability Calibration for Multiple Classifiers , 2013, IJCAI.

[38]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.