High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application.

[1]  Margaret Nampijja,et al.  Effect of single-dose anthelmintic treatment during pregnancy on an infant's response to immunisation and on susceptibility to infectious diseases in infancy: a randomised, double-blind, placebo-controlled trial , 2011, The Lancet.

[2]  Harriet Noreen,et al.  Impact of HLA class I and class II high-resolution matching on outcomes of unrelated donor bone marrow transplantation: HLA-C mismatching is associated with a strong adverse effect on transplantation outcome. , 2004, Blood.

[3]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[4]  M. Feolo,et al.  HLA Diversity in the 1000 Genomes Dataset , 2014, PloS one.

[5]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[6]  Peter Donnelly,et al.  A statistical method for predicting classical HLA alleles from SNP data. , 2008, American journal of human genetics.

[7]  Paul Weston,et al.  Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility , 2011, Nature Genetics.

[8]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[9]  Clive E. Bowman,et al.  Genetic variations in HLA-B region and hypersensitivity reactions to abacavir , 2002, The Lancet.

[10]  Buhm Han,et al.  Imputing Amino Acid Polymorphisms in Human Leukocyte Antigens , 2013, PloS one.

[11]  Pierre-Antoine Gourraud,et al.  Comparison of high-resolution human leukocyte antigen haplotype frequencies in different ethnic groups: Consequences of sampling fluctuation and haplotype frequency distribution tail truncation. , 2015, Human immunology.

[12]  Peter Parham,et al.  Different Patterns of Evolution in the Centromeric and Telomeric Regions of Group A and B Haplotypes of the Human Killer Cell Ig-Like Receptor Locus , 2010, PloS one.

[13]  Steven G E Marsh,et al.  IMGT/HLA and the Immuno Polymorphism Database. , 2014, Methods in molecular biology.

[14]  James Robinson,et al.  IPD—the Immuno Polymorphism Database , 2004, Nucleic acids research.

[15]  M. Ni,et al.  Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads , 2014, BMC Genomics.

[16]  N. Lennon,et al.  Next-generation sequencing for HLA typing of class I loci , 2011, BMC Genomics.

[17]  Matti Pirinen,et al.  A genome-wide association study identifies new psoriasis susceptibility loci and an interaction between HLA-C and ERAP1 , 2010, Nature Genetics.

[18]  M. Pirinen,et al.  Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis , 2013, Nature Genetics.

[19]  Daniel J Schaid,et al.  Prospective validation of HLA-DRB1*07:01 allele carriage as a predictive risk factor for lapatinib-induced liver injury. , 2014, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[20]  B S Weir,et al.  HIBAG—HLA genotype imputation with attribute bagging , 2013, The Pharmacogenomics Journal.

[21]  Benjamin Schubert,et al.  OptiType: precision HLA typing from next-generation sequencing data , 2014, Bioinform..

[22]  Alexander T. Dilthey,et al.  Multi-Population Classical HLA Type Imputation , 2013, PLoS Comput. Biol..

[23]  S. Beck,et al.  Plasticity in the organization and sequences of human KIR/ILT gene families. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[25]  Markus Uhrberg,et al.  Definition of gene content for nine common group B haplotypes of the Caucasoid population: KIR haplotypes contain between seven and eleven KIR genes , 2002, Immunogenetics.

[26]  K. Cibulskis,et al.  Detection of somatic mutations in human leukocyte antigen (HLA) genes using whole-exome sequencing , 2015 .

[27]  A. Hill,et al.  Human genetic susceptibility to infectious disease , 2012, Nature Reviews Genetics.

[28]  Pardis C Sabeti,et al.  A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC , 2006, Nature Genetics.

[29]  Richard A. Moore,et al.  Derivation of HLA types from shotgun sequence datasets , 2012, Genome Medicine.

[30]  P. Sham,et al.  HLAreporter: a tool for HLA typing from next generation sequencing data , 2015, Genome Medicine.

[31]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[32]  T. Karlsen,et al.  Development of a high-resolution NGS-based HLA-typing and analysis pipeline , 2015, Nucleic acids research.

[33]  Masao Nagasaki,et al.  HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data , 2015, BMC Genomics.

[34]  J. Wolchok,et al.  Genetic Basis for Clinical Response to CTLA-4 Blockade in Melanoma. , 2015, The New England journal of medicine.