Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS – the ability to detect genetic association by linkage disequilibrium (LD) – is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.

[1]  J. Barrera,et al.  Uncovering association networks through an eQTL analysis involving human miRNAs and lincRNAs , 2018, Scientific Reports.

[2]  Joseph O. Deasy,et al.  Computational methods using genome-wide association studies to predict radiotherapy complications and to identify correlative molecular processes , 2017, Scientific Reports.

[3]  P. Munroe,et al.  The biological impact of blood pressure-associated genetic variants in the natriuretic peptide receptor C gene on human vascular smooth muscle , 2017, Human molecular genetics.

[4]  Kai Wang,et al.  iMEGES: integrated mental-disorder GEnome score by deep neural network for prioritizing the susceptibility genes for mental disorders in personal genomes , 2018, BMC Bioinformatics.

[5]  Daniel L. Koller,et al.  Convergent functional genomics of schizophrenia: from comprehensive understanding to genetic risk prediction , 2012, Molecular Psychiatry.

[6]  P. Munroe,et al.  Genome-wide association study identifies loci for arterial stiffness index in 127,121 UK Biobank participants , 2019, Scientific Reports.

[7]  Mamta Giri,et al.  Prioritizing Crohn’s disease genes by integrating association signals with gene expression implicates monocyte subsets , 2019, Genes & Immunity.

[8]  S. Petrovski,et al.  Stochastic semi-supervised learning to prioritise genes from high-throughput genomic screens , 2019, bioRxiv.

[9]  Dimitris Mavridis,et al.  Comparative efficacy and tolerability of 15 antipsychotic drugs in schizophrenia: a multiple-treatments meta-analysis , 2013, The Lancet.

[10]  Cassandra N. Spracklen,et al.  Interethnic analyses of blood pressure loci in populations of East Asian and European descent , 2018, Nature Communications.

[11]  M. Nelson,et al.  Impact of genetically supported target selection on R&D productivity. , 2016, Nature reviews. Drug discovery.

[12]  Wenyu Wang,et al.  Making Sense of the Epigenome Using Data Integration Approaches , 2019, Front. Pharmacol..

[13]  Tatsuya Akutsu,et al.  Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features , 2011, BMC Bioinformatics.

[14]  James D. Malley,et al.  r2VIM: A new variable selection method for random forests in genome-wide association studies , 2016, BioData Mining.

[15]  Sarah A. Gagliano,et al.  Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants , 2014, Scientific Reports.

[16]  S. Danese New therapies for inflammatory bowel disease: from the bench to the bedside , 2011, Gut.

[17]  Pui-Yan Kwok,et al.  Prioritizing causal disease genes using unbiased genomic features , 2014, Genome Biology.

[18]  Nima Jafari Navimipour,et al.  Disease genes prioritizing mechanisms: a comprehensive and systematic literature review , 2017, Network Modeling Analysis in Health Informatics and Bioinformatics.

[19]  Ting Hu,et al.  A network approach to prioritizing susceptibility genes for genome‐wide association studies , 2019, Genetic epidemiology.

[20]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[21]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[22]  Tom R. Gaunt,et al.  Automating Mendelian randomization through machine learning to construct a putative causal map of the human phenome , 2017, bioRxiv.

[23]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..

[24]  L. Smeeth,et al.  For Personal Use. Only Reproduce with Permission from Elsevier Ltd Homocysteine and Stroke: Evidence on a Causal Link from Mendelian Randomisation , 2022 .

[25]  J. Xie,et al.  Stochastic Semi-supervised Learning , 2011, Active Learning and Experimental Design @ AISTATS.

[26]  X. Puente,et al.  Mutations in filamin C cause a new form of familial hypertrophic cardiomyopathy , 2014, Nature Communications.

[27]  R. Jiang,et al.  Prediction of enhancer-promoter interactions via natural language processing , 2018, BMC Genomics.

[28]  K. Swärd,et al.  Hypertension reduces soluble guanylyl cyclase expression in the mouse aorta via the Notch signaling pathway , 2017, Scientific Reports.

[29]  Eliseo Guallar,et al.  Achievement of treatment goals for primary prevention of cardiovascular disease in clinical practice across Europe: the EURIKA study , 2011, European heart journal.

[30]  Ofer Isakov,et al.  Machine Learning–Based Gene Prioritization Identifies Novel Candidate Risk Genes for Inflammatory Bowel Disease , 2017, Inflammatory bowel diseases.

[31]  Giorgio Valentini,et al.  Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants , 2017, Scientific Reports.

[32]  H. Nilsson,et al.  Targeting transcriptional control of soluble guanylyl cyclase via NOTCH for prevention of cardiovascular disease , 2018, Acta physiologica.

[33]  Brooke L. Fridley,et al.  A Latent Model for Prioritization of SNPs for Functional Studies , 2011, PloS one.

[34]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[35]  Satish Chikkagoudar,et al.  Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest , 2011, Nucleic acids research.

[36]  J. Ogutu,et al.  Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions , 2012, BMC Proceedings.

[37]  James W Baurley,et al.  Hierarchical Bayes prioritization of marker associations from a genome‐wide association scan for further investigation , 2007, Genetic epidemiology.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Sonja W. Scholz,et al.  Identification of novel risk loci, causal insights, and heritable risk for Parkinson's disease: a meta-analysis of genome-wide association studies , 2019, The Lancet Neurology.

[40]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[41]  Qian Wang,et al.  Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies , 2015, bioRxiv.

[42]  P. Munroe,et al.  Genome-Wide Analysis of Left Ventricular Image-Derived Phenotypes Identifies Fourteen Loci Associated With Cardiac Morphogenesis and Heart Failure Development , 2019, Circulation.

[43]  Kathryn S. Burch,et al.  Leveraging Polygenic Functional Enrichment to Improve GWAS Power. , 2019, American journal of human genetics.

[44]  Michael T. Eadon,et al.  RegSNPs-intron: a computational framework for predicting pathogenic impact of intronic single nucleotide variants , 2019, Genome Biology.

[45]  G. Davey Smith,et al.  Best (but oft-forgotten) practices: the design, analysis, and interpretation of Mendelian randomization studies1 , 2016, The American journal of clinical nutrition.

[46]  Zhongming Zhao,et al.  A Convergent Study of Genetic Variants Associated With Crohn’s Disease: Evidence From GWAS, Gene Expression, Methylation, eQTL and TWAS , 2019, Front. Genet..

[47]  Jiang Gui,et al.  Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors , 2014, BioData Mining.

[48]  Wei Zhang,et al.  Improved integrative framework combining association data with gene expression features to prioritize Crohn's disease genes. , 2015, Human molecular genetics.

[49]  I. Išgum,et al.  Machine Learning for Assessment of Coronary Artery Disease in Cardiac CT: A Survey , 2019, Front. Cardiovasc. Med..

[50]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[51]  Anne-Christin Hauschild,et al.  GWAS-based machine learning approach to predict duloxetine response in major depressive disorder. , 2018, Journal of psychiatric research.

[52]  Chandra L. Theesfeld,et al.  Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk , 2018, Nature Genetics.

[53]  L. Cardon,et al.  Use of genome-wide association studies for drug repositioning , 2012, Nature Biotechnology.

[54]  Laura J. Scott,et al.  Trans-ethnic association study of blood pressure determinants in over 750,000 individuals , 2018, Nature Genetics.

[55]  M. Nelson,et al.  Trial watch: Impact of genetically supported target selection on R&D productivity , 2016, Nature Reviews Drug Discovery.

[56]  Gilles Blanchard,et al.  Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies , 2016, Scientific Reports.

[57]  A D Roses,et al.  Increased amyloid beta-peptide deposition in cerebral cortex as a consequence of apolipoprotein E genotype in late-onset Alzheimer disease. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[58]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[59]  Christian Gieger,et al.  Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits , 2018, Nature Genetics.

[60]  Ivan Merelli,et al.  SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS , 2013, BMC Bioinformatics.

[61]  Wei Q. Deng,et al.  A machine-learning heuristic to improve gene score prediction of polygenic traits , 2017, Scientific Reports.

[62]  Kyung-Ah Sohn,et al.  Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure , 2014, Comput. Biol. Chem..

[63]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[64]  Hongyu Zhao,et al.  GenoWAP: GWAS signal prioritization through integrated analysis of genomic functional annotation , 2016, Bioinform..

[65]  Chakravarthi Kanduri,et al.  Colocalization analyses of genomic elements: approaches, recommendations and challenges , 2018, Bioinform..

[66]  M. Rahul Raj,et al.  Analysis of Computational Gene Prioritization Approaches , 2018 .

[67]  Michael J. E. Sternberg,et al.  Identification of disease-associated loci using machine learning for genotype and network data integration , 2019, Bioinform..

[68]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[69]  Hairong Lv,et al.  Leveraging multiple gene networks to prioritize GWAS candidate genes via network representation learning. , 2018, Methods.