ATHENA: the analysis tool for heritable and environmental network associations

MOTIVATION Advancements in high-throughput technology have allowed researchers to examine the genetic etiology of complex human traits in a robust fashion. Although genome-wide association studies have identified many novel variants associated with hundreds of traits, a large proportion of the estimated trait heritability remains unexplained. One hypothesis is that the commonly used statistical techniques and study designs are not robust to the complex etiology that may underlie these human traits. This etiology could include non-linear gene × gene or gene × environment interactions. Additionally, other levels of biological regulation may play a large role in trait variability. RESULTS To address the need for computational tools that can explore enormous datasets to detect complex susceptibility models, we have developed a software package called the Analysis Tool for Heritable and Environmental Network Associations (ATHENA). ATHENA combines various variable filtering methods with machine learning techniques to analyze high-throughput categorical (i.e. single nucleotide polymorphisms) and quantitative (i.e. gene expression levels) predictor variables to generate multivariable models that predict either a categorical (i.e. disease status) or quantitative (i.e. cholesterol levels) outcomes. The goal of this article is to demonstrate the utility of ATHENA using simulated and biological datasets that consist of both single nucleotide polymorphisms and gene expression variables to identify complex prediction models. Importantly, this method is flexible and can be expanded to include other types of high-throughput data (i.e. RNA-seq data and biomarker measurements). AVAILABILITY ATHENA is freely available for download. The software, user manual and tutorial can be downloaded from http://ritchielab.psu.edu/ritchielab/software.

[1]  Conor Ryan,et al.  Grammatical evolution , 2007, GECCO '07.

[2]  Marylyn D. Ritchie,et al.  Initialization parameter sweep in ATHENA: optimizing neural networks for detecting gene-gene interactions in the presence of small main effects , 2010, GECCO '10.

[3]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[4]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[5]  Aldi Kraja,et al.  Genome-wide discovery of loci influencing chemotherapy cytotoxicity. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[7]  Marylyn D. Ritchie,et al.  ATHENA Optimization: The Effect of Initial Parameter Settings across Different Genetic Models , 2011, EvoBio.

[8]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[9]  Marylyn D. Ritchie,et al.  Data Simulation Software for Whole-Genome Association and Other Studies in Human Genetics , 2005, Pacific Symposium on Biocomputing.

[10]  Marylyn D. Ritchie,et al.  Grammatical Evolution of Neural Networks for Discovering Epistasis among Quantitative Trait Loci , 2010, EvoBIO.

[11]  References , 1971 .

[12]  T. Ideker,et al.  A new approach to decoding life: systems biology. , 2001, Annual review of genomics and human genetics.

[13]  M. Barmada,et al.  Identifying genetic interactions in genome‐wide data using Bayesian networks , 2010, Genetic epidemiology.

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[17]  Jiang Gui,et al.  Symbolic Modeling of Epistasis , 2007, Human Heredity.

[18]  Eugene Charniak,et al.  Bayesian Networks without Tears , 1991, AI Mag..

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Yurii S. Aulchenko,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm108 Genetics and population analysis GenABEL: an R library for genome-wide association analysis , 2022 .

[22]  Marylyn D. Ritchie,et al.  Synthesis-View: visualization and interpretation of SNP association results for multi-cohort, multi-phenotype data and meta-analysis , 2010, BioData Mining.

[23]  Conor Ryan,et al.  Grammatical Evolution , 2001, Genetic Programming Series.

[24]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[25]  Marylyn D. Ritchie,et al.  ATHENA: A knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait Loci , 2010, BioData Mining.

[26]  Richard M Watanabe,et al.  Statistical issues in gene association studies. , 2011, Methods in molecular biology.

[27]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[28]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[29]  Marylyn D. Ritchie,et al.  Linkage Disequilibrium in Genetic Association Studies Improves the Performance of Grammatical Evolution Neural Networks , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[30]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[31]  Marylyn D. Ritchie,et al.  Comparison of Methods for Meta-dimensional Data Analysis Using in Silico and Biological Data Sets , 2012, EvoBIO.

[32]  J. DiStefano,et al.  Disease Gene Identification , 2011, Methods in Molecular Biology.

[33]  Brooke L. Fridley,et al.  Comparison of penalty functions for sparse canonical correlation analysis , 2012, Comput. Stat. Data Anal..

[34]  M. Ritchie,et al.  Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. , 2012, Pharmacogenomics.

[35]  David M. Reif,et al.  Integrated analysis of genetic, genomic and proteomic data , 2004, Expert review of proteomics.

[36]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[37]  M. Eileen Dolan,et al.  A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity , 2007, Proceedings of the National Academy of Sciences.