HIBAG—HLA genotype imputation with attribute bagging

Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n≈1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets.

[1]  Sue Povey,et al.  Gene map of the extended human MHC , 2004, Nature Reviews Genetics.

[2]  Peter Donnelly,et al.  A statistical method for predicting classical HLA alleles from SNP data. , 2008, American journal of human genetics.

[3]  James Robinson,et al.  The IMGT/HLA database , 2008, Nucleic Acids Res..

[4]  Kristin K Nicodemus,et al.  Linkage disequilibrium and age of HLA region SNPs in relation to classic HLA gene alleles within Europe , 2010, European Journal of Human Genetics.

[5]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[6]  B. Thiers HLA-B*5801 Allele as a Genetic Marker for Severe Cutaneous Adverse Reactions Caused by Allopurinol , 2006 .

[7]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[8]  Clive E. Bowman,et al.  Genetic variations in HLA-B region and hypersensitivity reactions to abacavir , 2002, The Lancet.

[9]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[10]  Alexander T. Dilthey,et al.  HLA*IMP - an integrated framework for imputing classical HLA alleles from SNP genotypes , 2011, Bioinform..

[11]  Francis K. H. Quek,et al.  Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[12]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[13]  W. G. Hill,et al.  Tests for association of gene frequencies at several loci in random mating diploid populations. , 1975, Biometrics.

[14]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[15]  Robert M. Plenge,et al.  Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis , 2011, Nature Genetics.

[16]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[17]  Jerzy K. Kulski,et al.  The HLA genomic loci map: expression, interaction, diversity and disease , 2009, Journal of Human Genetics.

[18]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[19]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[20]  Simon C. Potter,et al.  Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A , 2007, Nature.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Pardis C Sabeti,et al.  A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC , 2006, Nature Genetics.

[23]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[24]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[25]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[26]  Sharon R Browning,et al.  Multilocus association mapping using variable-length Markov chains. , 2006, American journal of human genetics.

[27]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[28]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[29]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[30]  Loren Gragert,et al.  High-resolution HLA alleles and haplotypes in the United States population. , 2007, Human immunology.

[31]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[32]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[33]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .