Iterative feature selection method to discover predictive variables and interactions for high-dimensional transplant genomic data

After allogeneic hematopoietic stem cell transplantation (allo-HCT), donor-derived immune cells can trigger devastating graft-versus-host disease (GVHD). The clinical effects of GVHD are well established; however, genetic mechanisms that contribute to the condition remain unclear. Candidate gene studies and genome-wide association studies have shown promising results, but they are limited to a few functionally derived genes and those with strong main effects. Transplant-related genomic studies examine two individuals simultaneously as a single case, which adds additional analytical challenges. In this study, we propose a hybrid feature selection algorithm, iterative Relief-based algorithm followed by a random forest (iRBA-RF), to reduce the SNPs from the original donor-recipient paired genotype data and select the most predictive SNP sets in association with the phenotypic outcome in question. The proposed method does not assume any main effect of the SNPs; instead, it takes into account the SNP interactions. We applied the iRBA-RF to a cohort (n=331) of acute myeloid leukemia (AML) patients and their fully 10 of 10 (HLA-A, -B, -C, -DRB1, and -DQB1) HLA-matched healthy unrelated donors and assessed two case-control scenarios: AML patients vs healthy donor as case vs control and acute GVHD group vs non-GVHD group as case vs control, respectively. The results show that iRBA-RF can efficiently reduce the size of SNPs set down to less than 0.05%. Moreover, the literature review showed that the selected SNPs appear functionally involved in the pathologic pathways of the phenotypic diseases in question, which may potentially explain the underlying mechanisms. This proposed method can effectively and efficiently analyze ultra-high dimensional genomic data and could help provide new insights into the development of transplant-related complications from a genomic perspective.

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  Edwina L. Rissland,et al.  CABOT: An Adaptive Approach to Case-Based Search , 1991, IJCAI.

[3]  A. Kolstø,et al.  A tight cluster of five unrelated human genes on chromosome 16q22.1. , 1993, Human molecular genetics.

[4]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[5]  R. Hoover,et al.  Solid cancers after bone marrow transplantation. , 1997, The New England journal of medicine.

[6]  Pedro M. Domingos Occam's Two Razors: The Sharp and the Blunt , 1998, KDD.

[7]  H. Prydz,et al.  Characterization of PSKH1, a novel human protein serine kinase with centrosomal, golgi, and nuclear localization. , 2000, Genomics.

[8]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[9]  J. Griffin,et al.  The roles of FLT3 in hematopoiesis and leukemia. , 2002, Blood.

[10]  M. Abdelhaleem The novel helicase homologue DDX32 is down-regulated in acute lymphoblastic leukemia. , 2002, Leukemia research.

[11]  H. Prydz,et al.  PSKH1, a novel splice factor compartment-associated serine kinase. , 2002, Nucleic acids research.

[12]  G. Tsujimoto,et al.  Analysis of Highly Expressed Genes in Monocytes from Atopic Dermatitis Patients , 2003, International Archives of Allergy and Immunology.

[13]  I. Weissman,et al.  A role for Wnt signalling in self-renewal of haematopoietic stem cells , 2003, Nature.

[14]  David A. Bell,et al.  A Formalism for Relevance and Its Application in Feature Subset Selection , 2000, Machine Learning.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[17]  Ping Ji,et al.  Translocation Products in Acute Myeloid Leukemia Activate the Wnt Signaling Pathway in Hematopoietic Cells , 2004, Molecular and Cellular Biology.

[18]  Pedro M. Domingos The Role of Occam's Razor in Knowledge Discovery , 1999, Data Mining and Knowledge Discovery.

[19]  Byoung-Tak Zhang,et al.  PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis , 2004 .

[20]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[21]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[22]  P. Gallagher,et al.  Human potassium chloride cotransporter 1 (SLC12A4) promoter is regulated by AP-2 and contains a functional downstream promoter element. , 2004, Blood.

[23]  R. Galli,et al.  Tie2 identifies a hematopoietic monocytes required for tumor lineage of proangiogenic vessel formation and a mesenchymal population of pericyte progenitors , 2005 .

[24]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[25]  Michael Ho,et al.  Expression of DHX32 in lymphoid tissues. , 2005, Experimental and molecular pathology.

[26]  H. Clevers,et al.  Wnt signalling in stem cells and cancer , 2005, Nature.

[27]  Luigi Naldini,et al.  Tie2 identifies a hematopoietic lineage of proangiogenic monocytes required for tumor vessel formation and a mesenchymal population of pericyte progenitors. , 2005, Cancer cell.

[28]  M. Abdelhaleem RNA helicases: regulators of differentiation. , 2005, Clinical biochemistry.

[29]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[31]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[32]  D. Whitcomb,et al.  Human Pancreatic Digestive Enzymes , 2007, Digestive Diseases and Sciences.

[33]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[34]  Philippe C. Besse,et al.  Identification of biomarkers of human pancreatic adenocarcinomas by expression profiling and validation with gene expression analysis in endoscopic ultrasound-guided fine needle aspiration samples. , 2006, World journal of gastroenterology.

[35]  Antonio Felipe,et al.  Potassium channels: new targets in cancer therapy. , 2006, Cancer detection and prevention.

[36]  Suk Woo Nam,et al.  Mutational analysis of PTPRT phosphatase domains in common human cancers , 2007, APMIS : acta pathologica, microbiologica, et immunologica Scandinavica.

[37]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[38]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[39]  Alexandra G. Smith,et al.  RAD51 homologous recombination repair gene haplotypes and risk of acute myeloid leukaemia. , 2007, Leukemia research.

[40]  S. Yamashita,et al.  Role of LCAT in HDL remodeling: investigation of LCAT deficiency states Published, JLR Papers in Press, December 20, 2006. , 2007, Journal of Lipid Research.

[41]  H. Ishwaran Variable importance in binary regression trees and forests , 2007, 0711.2434.

[42]  M. Huber,et al.  IRF4 is essential for IL-21-mediated induction, amplification, and stabilization of the Th17 phenotype , 2008, Proceedings of the National Academy of Sciences.

[43]  Margaret J. Eppstein,et al.  Very large scale ReliefF for genome-wide association analysis , 2008, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[44]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[45]  Chunaram Choudhary,et al.  Activation of Wnt signalling in acute myeloid leukemia by induction of Frizzled-4. , 2008, International journal of oncology.

[46]  S. Targan,et al.  MAGI2 genetic variation and inflammatory bowel disease , 2009, Inflammatory bowel diseases.

[47]  H. Erickson,et al.  Functional characterization of an activating TEK mutation in acute myeloid leukemia: a cellular context-dependent activating mutation , 2009, Leukemia.

[48]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[49]  Elena Marchiori,et al.  Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics , 2007, Lecture Notes in Computer Science.

[50]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[51]  A. Rudensky,et al.  Regulatory T-cell suppressor program co-opts transcription factor IRF4 to control TH2 responses , 2009, Nature.

[52]  M. Mengel,et al.  Immunoproteasome beta subunit 10 is increased in chronic antibody-mediated rejection. , 2010, Kidney international.

[53]  Wolfram Goessling,et al.  The Wnt/β-Catenin Pathway Is Required for the Development of Leukemia Stem Cells in AML , 2010, Science.

[54]  D. Cooper,et al.  Evidence for microRNA involvement in exercise-associated neutrophil gene expression changes. , 2010, Journal of applied physiology.

[55]  T. Hansen,et al.  Identification of KCNJ15 as a susceptibility gene in Asian patients with type 2 diabetes mellitus. , 2010, American journal of human genetics.

[56]  K. Wagner,et al.  Phosphoinositide phospholipase Cbeta1 (PI-PLCbeta1) gene in myelodysplastic syndromes and cytogenetically normal acute myeloid leukemia: not a deletion, but increased PI-PLCbeta1 expression is an independent prognostic factor. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[57]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[58]  L. Zhao,et al.  Defining genetic risk for graft-versus-host disease and mortality following allogeneic hematopoietic stem cell transplantation , 2010, Current opinion in hematology.

[59]  Jason H. Moore,et al.  The Informative Extremes: Using Both Nearest and Farthest Individuals Can Improve Relief Algorithms in the Domain of Human Genetics , 2010, EvoBIO.

[60]  Shyam Visweswaran,et al.  Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease , 2012, BioData Mining.

[61]  C. Lacroix,et al.  The Ubiquitin-Specific Protease USP34 Regulates Axin Stability and Wnt/β-Catenin Signaling , 2011, Molecular and Cellular Biology.

[62]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[63]  Y. Kodera,et al.  Clinical Significance of Regulatory T-Cell–Related Gene Expression in Peripheral Blood After Renal Transplantation , 2011, Transplantation.

[64]  W. Shi,et al.  The transcription factors Blimp-1 and IRF4 jointly control the differentiation and function of effector regulatory T cells , 2011, Nature Immunology.

[65]  W. Foulkes,et al.  miRNA Processing and Human Cancer: DICER1 Cuts the Mustard , 2011, Science Translational Medicine.

[66]  Kristin K. Nicodemus,et al.  Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures , 2011, Briefings Bioinform..

[67]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[68]  A. Hoischen,et al.  Amplified segment in the ‘Down Syndrome critical region’ on HSA21 shared between Down syndrome and euploid AML‐M0 excludes RUNX1, ERG and ETS2 , 2012, British journal of haematology.

[69]  K. Tokunaga,et al.  Inhibition of Glucose-Stimulated Insulin Secretion by KCNJ15, a Newly Identified Susceptibility Gene for Type 2 Diabetes , 2012, Diabetes.

[70]  S. Gudjonsson,et al.  IRF4 transcription-factor-dependent CD103(+)CD11b(+) dendritic cells drive mucosal T helper 17 cell differentiation. , 2013, Immunity.

[71]  N. McGovern,et al.  IRF4 Transcription Factor-Dependent CD11b+ Dendritic Cells in Human and Mouse Control Mucosal IL-17 Cytokine Responses , 2013, Immunity.

[72]  W. Birchmeier,et al.  Wnt signaling in stem and cancer stem cells. , 2013, Current opinion in cell biology.

[73]  P. Bolufer,et al.  Adverse prognostic value of MYBL2 overexpression and association with microRNA-30 family in acute myeloid leukemia patients. , 2013, Leukemia research.

[74]  Jason H. Moore,et al.  Multiple Threshold Spatially Uniform ReliefF for the Genetic Analysis of Complex Human Diseases , 2013, EvoBIO.

[75]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[76]  Ching Lee Koo,et al.  A Review for Detecting Gene-Gene Interactions Using Machine Learning Methods in Genetic Epidemiology , 2013, BioMed research international.

[77]  N. Divecha,et al.  Phospholipase c beta 1 (PLCb1) in acute myeloid leukemia (AML): a novel potential therapeutic target , 2014 .

[78]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[79]  G. Zhen,et al.  RAD51 Gene 135G/C polymorphism and the risk of four types of common cancers: a meta-analysis , 2014, Diagnostic Pathology.

[80]  K. Döhner,et al.  Tracing the development of acute myeloid leukemia in CBL syndrome. , 2014, Blood.

[81]  Hiroaki Kimura,et al.  New Insights into the Function of the Immunoproteasome in Immune and Nonimmune Cells , 2015, Journal of immunology research.

[82]  M. Lutz,et al.  In vitro-generated MDSCs prevent murine GVHD by inducing type 2 T cells without disabling antitumor cytotoxicity. , 2015, Blood.

[83]  M. Bryś,et al.  Polymorphisms of Homologous Recombination RAD51, RAD51B, XRCC2, and XRCC3 Genes and the Risk of Prostate Cancer , 2015, Analytical cellular pathology.

[84]  Effie W Petersdorf,et al.  High HLA-DP Expression and Graft-versus-Host Disease. , 2015, The New England journal of medicine.

[85]  Xifeng Qian,et al.  [Relationship between RAD51-G135C and XRCC3-C241T Single Nucleotide Polymorphisms and Onset of Acute Myeloid Leukemia]. , 2015, Zhongguo shi yan xue ye xue za zhi.

[86]  M. Norkin,et al.  Indications for allo- and auto-SCT for haematological diseases, solid tumours and immune disorders: current practice in Europe, 2015 , 2015, Bone Marrow Transplantation.

[87]  Penggao Dai,et al.  Expression Profile Analysis of Zinc Transporters (ZIP4, ZIP9, ZIP11, ZnT9) in Gliomas and their Correlation with IDH1 Mutation Status. , 2015, Asian Pacific journal of cancer prevention : APJCP.

[88]  Jae-Bong Lee,et al.  Association of the Single Nucleotide Polymorphisms in RUNX1, DYRK1A, and KCNJ15 with Blood Related Traits in Pigs , 2016, Asian-Australasian journal of animal sciences.

[89]  J. Falkenburg,et al.  Autosomal Minor Histocompatibility Antigens: How Genetic Variants Create Diversity in Immune Targets , 2016, Front. Immunol..

[90]  C. Csizmar,et al.  The role of the proteasome in AML , 2016, Blood Cancer Journal.

[91]  J. Dopazo,et al.  The Mutational Landscape of Acute Promyelocytic Leukemia Reveals an Interacting Network of Co-Occurrences and Recurrent Mutations , 2016, PloS one.

[92]  F. Korner‐Nievergelt,et al.  The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research , 2017, PeerJ.

[93]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[94]  C. Schmitt,et al.  Initiation of acute graft-versus-host disease by angiogenesis. , 2016, Blood.

[95]  Jingqing Yang,et al.  Zinc transporters and dysregulated channels in cancers. , 2017, Frontiers in bioscience.

[96]  O. Delattre,et al.  MYBL2 (B-Myb): a central regulator of cell proliferation, cell survival and differentiation involved in tumorigenesis , 2017, Cell Death & Disease.

[97]  V. Paunic,et al.  Investigating the Association of Genetic Admixture and Donor/Recipient Genetic Disparity with Transplant Outcomes. , 2017, Biology of blood and marrow transplantation : journal of the American Society for Blood and Marrow Transplantation.

[98]  Sarah C. Nelson,et al.  Genome-wide minor histocompatibility matching as related to the risk of graft-versus-host disease. , 2017, Blood.

[99]  Stefan Wellek,et al.  A critical evaluation of the current “p‐value controversy” , 2017, Biometrical journal. Biometrische Zeitschrift.

[100]  L. Murphy,et al.  Recurrent copy number alterations in young women with breast cancer , 2018, Oncotarget.

[101]  J. McCubrey,et al.  Nuclear phospholipase C isoenzyme imbalance leads to pathologies in brain, hematologic, neuromuscular, and fertility disorders[S] , 2018, Journal of Lipid Research.

[102]  I. Moret,et al.  Different Genetic Expression Profiles of Oxidative Stress and Apoptosis-Related Genes in Crohn’s Disease , 2018, Digestion.

[103]  Anne-Laure Boulesteix,et al.  A computationally fast variable importance test for random forests for high-dimensional data , 2015, Adv. Data Anal. Classif..

[104]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[105]  Hemant Ishwaran,et al.  A prediction-based alternative to P values in regression models. , 2017, The Journal of thoracic and cardiovascular surgery.

[106]  I. Maillard,et al.  New Insights into Graft-Versus-Host Disease and Graft Rejection. , 2018, Annual review of pathology.

[107]  M. Labopin,et al.  Evaluation of Second Solid Cancers After Hematopoietic Stem Cell Transplantation in European Patients , 2019, JAMA oncology.

[108]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[109]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[110]  Navigating through Mutations in Acute Myeloid Leukemia. What Do We Know and What Do We Do with It? , 2018, Erciyes Tıp Dergisi/Erciyes Medical Journal.

[111]  D. Weatherall,et al.  Sickle cell disease , 2018, Nature Reviews Disease Primers.

[112]  Hemant Ishwaran,et al.  Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival , 2018, Statistics in medicine.

[113]  Alex A Freitas,et al.  Investigating the role of Simpson's paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets , 2020, Briefings Bioinform..