A machine-learning method for biobank-scale genetic prediction of blood group antigens

Blood transfusion is a life-saving medical procedure performed routinely worldwide. A key element for successful transfusion is compatibility of the patient and donor red blood cell (RBC) antigens. Precise antigen matching reduces the risk for immunization and other adverse transfusion outcomes. RBC antigens are encoded by specific genes, which allows developing computational methods for determining antigens from genomic data. We describe here a classification method for determining RBC antigens from genotyping array data. Random forest models for 39 RBC antigens in 14 blood group systems and for human platelet antigen (HPA)-1 were trained and tested using genotype and RBC antigen and HPA-1 typing data available for 1,192 blood donors in the Finnish Blood Service Biobank. The algorithm and models were further evaluated using a validation cohort of 111,667 Danish blood donors. In the Finnish test data set, the median (interquartile range [IQR]) balanced accuracy for 39 models was 99.9 (98.9-100)%. We were able to replicate 34 out of 39 Finnish models in the Danish cohort and the median (IQR) balanced accuracy for classifications was 97.1 (90.1-99.4)%. When applying models trained with the Danish cohort, the median (IQR) balanced accuracy for the 40 Danish models in the Danish test data set was 99.3 (95.1-99.8)%. The RBC antigen and HPA-1 prediction models demonstrated high overall accuracies suitable for probabilistic determination of blood groups and HPA-1 at biobank-scale. Furthermore, population-specific training cohort increased the accuracies of the models. This stand-alone and freely available method is applicable for research and screening for antigen-negative blood donors.

[1]  Jacob C. Ulirsch,et al.  FinnGen provides genetic insights from a well-phenotyped isolated population , 2023, Nature.

[2]  K. Park,et al.  Prediction of various blood group systems using Korean whole-genome sequencing data , 2022, PloS one.

[3]  D. Monos,et al.  Accurate long-read sequencing allows assembly of the duplicated RHD and RHCE genes harboring variants relevant to blood transfusion. , 2021, American journal of human genetics.

[4]  C. E. van der Schoot,et al.  Extended red blood cell matching for all transfusion recipients is feasible , 2021, Transfusion medicine.

[5]  J. Partanen,et al.  KIR gene content imputation from single-nucleotide polymorphisms in the Finnish population , 2021, bioRxiv.

[6]  R. Dmochowski,et al.  Foundational Statistical Principles in Medical Research: Sensitivity, Specificity, Positive Predictive Value, and Negative Predictive Value , 2021, Medicina.

[7]  R. Berkowitz,et al.  New Developments in Fetal and Neonatal Alloimmune Thrombocytopenia. , 2021, American journal of obstetrics and gynecology.

[8]  William J. Astle,et al.  Development and validation of a universal blood donor genotyping platform: a multinational prospective study. , 2020, Blood advances.

[9]  J. Partanen,et al.  Increasing accuracy of HLA imputation by a population-specific reference panel in a FinnGen biobank cohort , 2020, NAR genomics and bioinformatics.

[10]  Kohske Takahashi,et al.  Welcome to the Tidyverse , 2019, J. Open Source Softw..

[11]  H. Ullum,et al.  DBDS Genomic Cohort, a prospective and comprehensive resource for integrative and temporal analysis of genetic, environmental and lifestyle factors affecting health of blood donors , 2019, BMJ Open.

[12]  R. Goel,et al.  Noninfectious transfusion-associated adverse events and their mitigation strategies. , 2019, Blood.

[13]  Caspar G. Chorus,et al.  Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis , 2018, Journal of Choice Modelling.

[14]  Matthew S. Lebo,et al.  Automated typing of red blood cell and platelet antigens: a whole-genome sequencing study. , 2018, The Lancet. Haematology.

[15]  Magnus Jöud,et al.  Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project. , 2016, Blood advances.

[16]  Meghan Delaney,et al.  Hemolytic Disease of the Fetus and Newborn: Modern Practice and Future Investigations. , 2016, Transfusion medicine reviews.

[17]  O. Visser,et al.  Red-blood-cell alloimmunisation in relation to antigens' exposure and their immunogenicity: a cohort study. , 2016, The Lancet. Haematology.

[18]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[19]  A. Brand,et al.  Incidence of alloantibody formation after ABO‐D or extended matched red blood cell transfusions: a randomized trial (MATCH study) , 2016, Transfusion.

[20]  W. Flegel,et al.  Implementing mass‐scale red cell genotyping at a blood center , 2015, Transfusion.

[21]  Gil McVean,et al.  Imputation of KIR Types from SNP Variation Data , 2015, American journal of human genetics.

[22]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[23]  Silvio C. E. Tosatto,et al.  BOOGIE: Predicting Blood Groups from High Throughput Sequencing Data , 2015, PloS one.

[24]  B. Shaz,et al.  Red blood cell alloimmunization mitigation strategies. , 2014, Transfusion medicine reviews.

[25]  B S Weir,et al.  HIBAG—HLA genotype imputation with attribute bagging , 2013, The Pharmacogenomics Journal.

[26]  H. Ullum,et al.  The Danish Blood Donor Study: a large, prospective cohort and biobank for medical research , 2012, Vox sanguinis.

[27]  R. Nuss,et al.  Extended red blood cell antigen matching for transfusions in sickle cell disease: a review of a 14‐year experience from a single center (CME) , 2011, Transfusion.

[28]  B. Veldhuisen,et al.  Blood group genotyping: from patient to high‐throughput donor screening , 2009, Vox sanguinis.

[29]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[30]  L. Castilho,et al.  Blood group genotyping , 2004 .

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[32]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[33]  L. Breiman Random Forests , 2001, Machine Learning.