Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women

Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of <inline-formula><tex-math notation="LaTeX">$5*10^{-8}$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq1-2868667.gif"/></alternatives></inline-formula> to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasis can be extracted from 4,666 raw SNPs generated using logistic regression (p-value = <inline-formula><tex-math notation="LaTeX">$5*10^{-3}$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq2-2868667.gif"/></alternatives></inline-formula>) and used to fit a highly accurate classifier model. The following results (Sen = 0.9562, Spec = 0.8780, Gini = 0.9490, Logloss = 0.5901, AUC = 0.9745, and MSE = 0.2010) were obtained using 50 hidden nodes and (Sen = 0.9289, Spec = 0.9591, Gini = 0.9651, Logloss = 0.3080, AUC = 0.9825, and MSE = 0.0942) using 500 hidden nodes. The results were compared with a Support Vector Machine (SVM), a Random Forest (RF), and a Fishers Linear Discriminant Analysis classifier, which all failed to improve on the deep learning approach.

[1]  O. J. Dunn Estimation of the Medians for Dependent Variables , 1959 .

[2]  Tianhua Niu,et al.  A candidate gene association study on preterm delivery: application of high-throughput genotyping technology and advanced statistical methods. , 2004, Human molecular genetics.

[3]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[4]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[5]  N. Martin,et al.  Genetic influences on premature parturition in an Australian twin sample , 2000, Twin Research.

[6]  A. Morris,et al.  Data quality control in genetic case-control association studies , 2010, Nature Protocols.

[7]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[8]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[9]  C. Hogue,et al.  Preterm delivery and low birth weight among first-born infants of black and white college graduates. , 1992, American journal of epidemiology.

[10]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[11]  Marylyn D. Ritchie,et al.  Use of Biological Knowledge to Inform The Analysis of Gene-Gene Interactions Involved in Modulating Virologic Failure with Efavirenz-Containing Treatment Regimens in Art-Naive Actg Clinical Trials Participants , 2011, Pacific Symposium on Biocomputing.

[12]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[13]  Louis Wehenkel,et al.  An efficient algorithm to perform multiple testing in epistasis screening , 2013, BMC Bioinformatics.

[14]  Asako Koike,et al.  SNPInterForest: A new method for detecting epistatic interactions , 2011, BMC Bioinformatics.

[15]  Joy Lawn,et al.  Born Too Soon: The global epidemiology of 15 million preterm births , 2013, Reproductive Health.

[16]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[17]  Kristel Van Steen,et al.  mbmdr: an R package for exploring gene-gene interactions associated with binary or quantitative traits , 2010, Bioinform..

[18]  Gloria Giarratano,et al.  Genetic Influences on Preterm Birth , 2006, MCN. The American journal of maternal child nursing.

[19]  Nizar Bouguila,et al.  Classification of caesarean section and normal vaginal deliveries using foetal heart rate signals and advanced machine learning algorithms , 2017, Biomedical engineering online.

[20]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[21]  J. Moutquin Classification and heterogeneity of preterm birth , 2003, BJOG : an international journal of obstetrics and gynaecology.

[22]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[23]  Randy C. Paffenroth,et al.  Anomaly Detection with Robust Deep Autoencoders , 2017, KDD.

[24]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[25]  Chris S. Haley,et al.  Detecting epistasis in human complex traits , 2014, Nature Reviews Genetics.

[26]  Michael F. Wangler,et al.  Racial disparity in the frequency of recurrence of preterm birth. , 2007, American journal of obstetrics and gynecology.

[27]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[28]  Yudi Pawitan,et al.  Maternal effects for preterm birth: a genetic epidemiologic study of 630,000 families. , 2009, American journal of epidemiology.

[29]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[30]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[31]  Michael C Neale,et al.  The contribution of genetic and environmental factors to the duration of pregnancy. , 2014, American journal of obstetrics and gynecology.

[32]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[33]  Fabian J. Theis,et al.  DeepWAS : Directly integrating regulatory information into GWAS using 1 deep learning supports master regulator MEF 2 C as risk factor for major 2 depressive disorder 3 4 , 2016 .

[34]  J. Berkson Why I Prefer Logits to Probits , 1951 .

[35]  Anne Greenough,et al.  Long term respiratory outcomes of very premature birth (<32 weeks). , 2012, Seminars in fetal & neonatal medicine.

[36]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[37]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[38]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[39]  S Cnattingius,et al.  Genetic influence on birthweight and gestational length determined by studies in offspring of twins , 2000, BJOG : an international journal of obstetrics and gynaecology.

[40]  H. Hoffman,et al.  Medical, psychosocial, and behavioral risk factors do not explain the increased risk for low birth weight among black women. , 1996, American journal of obstetrics and gynecology.

[41]  K. Tsuda,et al.  Statistical significance of combinatorial regulations , 2013, Proceedings of the National Academy of Sciences.

[42]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[43]  Roberto Romero,et al.  Epidemiology and causes of preterm birth , 2008, The Lancet.

[44]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.