Chances and challenges of machine learning based disease classification in genetic association studies illustrated on age-related macular degeneration

Imaging technology and machine learning algorithms for disease classification set the stage for high-throughput phenotyping and promising new avenues for genome-wide association studies (GWAS). Despite emerging algorithms, there has been no successful application in GWAS so far. We established machine learning based disease classification in genetic association analysis as a misclassification problem. To evaluate chances and challenges, we performed a GWAS based on automated classification of age-related macular degeneration (AMD) in UK Biobank (images from 135,500 eyes; 68,400 persons). We quantified misclassification of automatically derived AMD in internal validation data (images from 4,001 eyes; 2,013 persons) and developed a maximum likelihood approach (MLA) to account for it when estimating genetic association. We demonstrate that our MLA guards against bias and artefacts in simulation studies. By combining a GWAS on automatically derived AMD classification and our MLA in UK Biobank data, we were able to dissect true association (ARMS2/HTRA1, CFH) from artefacts (near HERC2) and to identify eye color as relevant source of misclassification. On this example of AMD, we are able to provide a proof-of-concept that a GWAS using machine learning derived disease classification yields relevant results and that misclassification needs to be considered in the analysis. These findings generalize to other phenotypes and also emphasize the utility of genetic data for understanding misclassification structure of machine learning algorithms.

[1]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[2]  Michael Boehnke,et al.  Recommended Joint and Meta‐Analysis Strategies for Case‐Control Association Testing of Single Low‐Count Variants , 2013, Genetic epidemiology.

[3]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[4]  Nicholas G Martin,et al.  A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. , 2008, American journal of human genetics.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  F. Kronenberg,et al.  On the impact of different approaches to classify age-related macular degeneration: Results from the German AugUR study , 2018, Scientific Reports.

[7]  Yifan Peng,et al.  DeepSeeNet: A deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs , 2018, Ophthalmology.

[8]  Edward Klorman,et al.  Web Resources , 2019, Istanbul.

[9]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[10]  A. Peters,et al.  A Deep Learning Algorithm for Prediction of Age-Related Eye Disease Study Severity Scale for Age-Related Macular Degeneration from Color Fundus Photography. , 2018, Ophthalmology.

[11]  Neil J. Joshi,et al.  Automated Grading of Age-Related Macular Degeneration From Color Fundus Images Using Deep Convolutional Neural Networks , 2017, JAMA ophthalmology.

[12]  Christina Heinze-Deml,et al.  Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[13]  Yara T. E. Lechanteur,et al.  Nature Genetics Advance Online Publication , 2022 .

[14]  E. Finkelstein,et al.  Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes , 2017, JAMA.

[15]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[16]  J. Hausman,et al.  Misclassification of the dependent variable in a discrete-response setting , 1998 .

[17]  Robert H Lyles,et al.  Validation Data-based Adjustments for Outcome Misclassification in Logistic Regression: An Illustration , 2011, Epidemiology.

[18]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[19]  Sven Bergmann,et al.  Methods for testing association between uncertain genotypes and quantitative traits. , 2011, Biostatistics.

[20]  Gabriëlle H S Buitendijk,et al.  Harmonizing the Classification of Age-related Macular Degeneration in the Three-Continent AMD Consortium , 2014, Ophthalmic epidemiology.

[21]  P. Mitchell,et al.  Clinical risk factors for age-related macular degeneration: a systematic review and meta-analysis , 2010, BMC ophthalmology.

[22]  Gabriela Csurka,et al.  Domain Adaptation in Computer Vision Applications , 2017, Advances in Computer Vision and Pattern Recognition.

[23]  J. Neuhaus Bias and efficiency loss due to misclassified responses in binary regression , 1999 .

[24]  F. Kronenberg,et al.  The German AugUR study: study protocol of a prospective study to investigate chronic diseases in the elderly , 2015, BMC Geriatrics.

[25]  Felix Günther,et al.  Response misclassification in studies on bilateral diseases , 2019, Biometrical journal. Biometrische Zeitschrift.

[26]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[27]  Gabriela Csurka,et al.  A Comprehensive Survey on Domain Adaptation for Visual Applications , 2017, Domain Adaptation in Computer Vision Applications.

[28]  Mitchell J. Machiela,et al.  LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants , 2015, Bioinform..

[29]  Matthew D. Davis,et al.  The Age-Related Eye Disease Study Severity Scale for Age-Related Macular Degeneration , 2015 .

[30]  Po-Ru Loh,et al.  Mixed-model association for biobank-scale datasets , 2018, Nature Genetics.

[31]  Kinpui Chan,et al.  Optical Coherence Tomography in the UK Biobank Study – Rapid Automated Analysis of Retinal Thickness for Large Population-Based Studies , 2016, PloS one.

[32]  Mulin Jun Li,et al.  Nature Genetics Advance Online Publication a N a Ly S I S the Support of Human Genetic Evidence for Approved Drug Indications , 2022 .