Gene-gene interaction: the curse of dimensionality.

Identified genetic variants from genome wide association studies frequently show only modest effects on the disease risk, leading to the "missing heritability" problem. An avenue, to account for a part of this "missingness" is to evaluate gene-gene interactions (epistasis) thereby elucidating their effect on complex diseases. This can potentially help with identifying gene functions, pathways, and drug targets. However, the exhaustive evaluation of all possible genetic interactions among millions of single nucleotide polymorphisms (SNPs) raises several issues, otherwise known as the "curse of dimensionality". The dimensionality involved in the epistatic analysis of such exponentially growing SNPs diminishes the usefulness of traditional, parametric statistical methods. With the immense popularity of multifactor dimensionality reduction (MDR), a non-parametric method, proposed in 2001, that classifies multi-dimensional genotypes into one- dimensional binary approaches, led to the emergence of a fast-growing collection of methods that were based on the MDR approach. Moreover, machine-learning (ML) methods such as random forests and neural networks (NNs), deep-learning (DL) approaches, and hybrid approaches have also been applied profusely, in the recent years, to tackle this dimensionality issue associated with whole genome gene-gene interaction studies. However, exhaustive searching in MDR based approaches or variable selection in ML methods, still pose the risk of missing out on relevant SNPs. Furthermore, interpretability issues are a major hindrance for DL methods. To minimize this loss of information, Python based tools such as PySpark can potentially take advantage of distributed computing resources in the cloud, to bring back smaller subsets of data for further local analysis. Parallel computing can be a powerful resource that stands to fight this "curse". PySpark supports all standard Python libraries and C extensions thus making it convenient to write codes to deliver dramatic improvements in processing speed for extraordinarily large sets of data.

[1]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[2]  Suneetha Uppu,et al.  A Deep Learning Approach to Detect SNP Interactions , 2016, J. Softw..

[3]  Sungkyoung Choi,et al.  Risk Prediction Using Genome-Wide Association Studies on Type 2 Diabetes , 2016, Genomics & informatics.

[4]  Kristel Van Steen,et al.  A roadmap to multifactor dimensionality reduction methods , 2015, Briefings Bioinform..

[5]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[6]  Ie-Bin Lian,et al.  Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions. , 2014, Gene.

[7]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[8]  Jun Zhu,et al.  A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. , 2007, American journal of human genetics.

[9]  Chien-Te Fan,et al.  Taiwan Biobank: a project aiming to aid Taiwan's transition into a biomedical island. , 2008, Pharmacogenomics.

[10]  G. Mendel,et al.  Mendel's Principles of Heredity , 1910, Nature.

[11]  Jiang Gui,et al.  A Robust Multifactor Dimensionality Reduction Method for Detecting Gene–Gene Interactions with Application to the Genetic Analysis of Bladder Cancer Susceptibility , 2011, Annals of human genetics.

[12]  Taesung Park,et al.  Odds ratio based multifactor-dimensionality reduction method for detecting gene – gene interactions , 2006 .

[13]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[14]  Alison A Motsinger-Reif,et al.  Power of grammatical evolution neural networks to detect gene-gene interactions in the presence of error , 2008, BMC Research Notes.

[15]  E. Lander,et al.  The mystery of missing heritability: Genetic interactions create phantom heritability , 2012, Proceedings of the National Academy of Sciences.

[16]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[17]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[18]  Seungyeoun Lee,et al.  Gene–gene interaction analysis for the survival phenotype based on the Cox model , 2012, Bioinform..

[19]  Gilles Louppe,et al.  Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies , 2014, PloS one.

[20]  S. Bohté,et al.  Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype , 2019, bioRxiv.

[21]  Jason H. Moore,et al.  Genetic programming neural networks: A powerful bioinformatics tool for human genetics , 2007, Appl. Soft Comput..

[22]  Angeline S. Andrew,et al.  A novel survival multifactor dimensionality reduction method for detecting gene–gene interactions with application to bladder cancer prognosis , 2010, Human Genetics.

[23]  P. Phillips Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems , 2008, Nature Reviews Genetics.

[24]  Seungyeoun Lee,et al.  Gene-Gene Interaction Analysis for the Accelerated Failure Time Model Using a Unified Model-Based Multifactor Dimensionality Reduction Method , 2016, Genomics & informatics.

[25]  Asako Koike,et al.  SNPInterForest: A new method for detecting epistatic interactions , 2011, BMC Bioinformatics.

[26]  Yupeng Wang,et al.  Finding the Sources of Missing Heritability within Rare Variants Through Simulation , 2017, Bioinformatics and biology insights.

[27]  Wenfeng Qian,et al.  Genetic Interaction Network as an Important Determinant of Gene Order in Genome Evolution , 2017, Molecular biology and evolution.

[28]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[29]  M. L. Calle,et al.  Model‐Based Multifactor Dimensionality Reduction for detecting epistasis in case–control data in the presence of noise , 2011, Annals of human genetics.

[30]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[31]  Shyh-Huei Chen,et al.  A support vector machine approach for detecting gene‐gene interaction , 2008, Genetic epidemiology.

[32]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.