Learning epistatic polygenic phenotypes with Boolean interactions

Detecting epistatic drivers of human phenotypes remains a challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving single pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests from interactions by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that evaluate the stability of improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline using the phenotype of red-hair from the UK Biobank, where several genes are known to demonstrate epistatic interactions. epiTree recovers both previously reported and novel interactions, which represent forms of non-linearities not captured by logistic regression models. Additionally, epiTree suggests interactions between genes such as PKHD1 and XPOTP1, which are unlinked to MC1R, as novel candidate interactions associated with the red hair phenotype. Last but not least, we find that individual Boolean or tree-based epistasis models generally provide higher prediction accuracy than classical logistic regression.

[1]  Tomaso A. Poggio,et al.  Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant , 1989, Neural Computation.

[2]  K. Rawlik,et al.  Genome-wide study of hair colour in UK Biobank explains most of the SNP heritability , 2018, Nature Communications.

[3]  Julian J. Faraway,et al.  Does data splitting improve prediction? , 2013, Stat. Comput..

[4]  J H Moore,et al.  How to increase our belief in discovered statistical interactions via large-scale association studies? , 2019, Human Genetics.

[5]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[6]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[7]  Heping Zhang,et al.  A forest-based approach to identifying gene and gene–gene interactions , 2007, Proceedings of the National Academy of Sciences.

[8]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[9]  L. Saulis,et al.  Limit theorems for large deviations , 1991 .

[10]  Bin Yu,et al.  Refining interaction search through signed iterative Random Forests , 2018, bioRxiv.

[11]  David B. Allison,et al.  How accurate are the extremely small P-values used in genomic research: An evaluation of numerical libraries , 2009, Comput. Stat. Data Anal..

[12]  M. Wade,et al.  Alternative definitions of epistasis: dependence and interaction , 2001 .

[13]  Asako Koike,et al.  SNPInterForest: A new method for detecting epistatic interactions , 2011, BMC Bioinformatics.

[14]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  J. W. Little,et al.  Threshold effects in gene regulation: when some is not enough. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  M. Ritchie Finding the epistasis needles in the genome-wide haystack. , 2015, Methods in molecular biology.

[18]  P. Visscher,et al.  Another Explanation for Apparent Epistasis , 2014 .

[19]  J. W. Little,et al.  Robustness of a gene regulatory circuit , 1999, The EMBO journal.

[20]  Rajen Dinesh Shah,et al.  Random intersection trees , 2013, J. Mach. Learn. Res..

[21]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[22]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[23]  Kwangwoo Kim Massive false-positive gene–gene interactions by Rothman’s additive model , 2018, Annals of the rheumatic diseases.

[24]  Imperfect Linkage Disequilibrium Generates Phantom Epistasis (& Perils of Big Data) , 2019, G3: Genes, Genomes, Genetics.

[25]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[26]  G. Mendel,et al.  Mendel's Principles of Heredity , 1910, Nature.

[27]  Momiao Xiong,et al.  A Novel Statistic for Genome-Wide Interaction Analysis , 2010, PLoS genetics.

[28]  David Gal,et al.  Abandon Statistical Significance , 2017, The American Statistician.

[29]  Frank D. Gray,et al.  Hypoxia , 1964, The Yale Journal of Biology and Medicine.

[30]  Nir Friedman,et al.  Quantitative kinetic analysis of the bacteriophage λ genetic network , 2005 .

[31]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.

[32]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[33]  B. Bedogni,et al.  Hypoxia, melanocytes and melanoma – survival and tumor development in the permissive microenvironment of the skin , 2009, Pigment cell & melanoma research.

[34]  Stability , 1973 .

[35]  Hannes Leeb,et al.  Conditional predictive inference post model selection , 2009, 0908.3615.

[36]  S. Wuchty,et al.  eQTL Epistasis – Challenges and Computational Approaches , 2013, Front. Genet..

[37]  D. Clayton,et al.  Statistical modeling of interlocus interactions in a complex disease: rejection of the multiplicative model of epistasis in type 1 diabetes. , 2001, Genetics.

[38]  Debbie S. Yuster,et al.  A complete classification of epistatic two-locus models , 2006, BMC Genetics.

[39]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[40]  P. Phillips Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems , 2008, Nature Reviews Genetics.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  T. Hwa,et al.  Small RNAs establish gene expression thresholds. , 2008, Current opinion in microbiology.

[43]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[44]  T Mark Beasley,et al.  Rank-Based Inverse Normal Transformations are Increasingly Used, But are They Merited? , 2009, Behavior genetics.

[45]  Iris Pigeot,et al.  Modeling Gene-Gene Interactions Using Graphical Chain Models , 2007, Human Heredity.

[46]  S. Nagaev Some Limit Theorems for Large Deviations , 1965 .

[47]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[48]  M. McCarthy,et al.  Large-Scale Analyses Provide No Evidence for Gene-Gene Interactions Influencing Type 2 Diabetes Risk , 2020, Diabetes.

[49]  Bin Yu,et al.  Three principles of data science: predictability, computability, and stability (PCS) , 2019 .

[50]  L. Wasserman,et al.  Universal inference , 2019, Proceedings of the National Academy of Sciences.

[51]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[52]  David Curtis,et al.  Application of Logistic Regression to Case-Control Association Studies Involving Two Causative Loci , 2005, Human Heredity.

[53]  Nicholas J Timpson,et al.  Genome‐Wide Association Scan Allowing for Epistasis in Type 2 Diabetes , 2011, Annals of human genetics.

[54]  Michael J Harms,et al.  Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps , 2016, Genetics.

[55]  Ö. Carlborg,et al.  On the Relationship Between High-Order Linkage Disequilibrium and Epistasis , 2018, G3: Genes, Genomes, Genetics.

[56]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[57]  G. Wahba Bayesian "Confidence Intervals" for the Cross-validated Smoothing Spline , 1983 .

[58]  Masao Ueki,et al.  Improved Statistics for Genome-Wide Interaction Analysis , 2012, PLoS genetics.