Optimized permutation testing for information theoretic measures of multi-gene interactions

Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 10 3 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts .

[1]  K. Singleton,et al.  An omnibus test for the two-sample problem using the empirical characteristic function , 1986 .

[2]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[3]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[6]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[7]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[8]  D. Anastassiou Computational analysis of the synergy among multiple interacting genes , 2007, Molecular systems biology.

[9]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[10]  Yi Wang,et al.  Exploration of gene–gene interaction effects using entropy-based methods , 2008, European Journal of Human Genetics.

[11]  P. Chanda,et al.  AMBIENCE: A Novel Approach and Efficient Algorithm for Identifying Informative Genetic and Environmental Associations With Complex Phenotypes , 2008, Genetics.

[12]  Brian L. Browning,et al.  PRESTO: Rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies , 2008, BMC Bioinformatics.

[13]  Aidong Zhang,et al.  Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits , 2009, BMC Genomics.

[14]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[15]  Eleazar Eskin,et al.  Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers , 2009, PLoS genetics.

[16]  Aidong Zhang,et al.  The interaction index, a novel information-theoretic metric for prioritizing interacting genetic variations and environmental factors , 2009, European Journal of Human Genetics.

[17]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[18]  P. Chanda,et al.  Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity , 2010, BMC Genomics.

[19]  Helmut Schäfer,et al.  PERMORY: an LD-exploiting permutation test algorithm for powerful genome-wide association testing , 2010, Bioinform..

[20]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[21]  Blaz Zupan,et al.  SNPsyn: detection and exploration of SNP–SNP interactions , 2011, Nucleic Acids Res..

[22]  C I Amos,et al.  Entropy‐based information gain approaches to detect and to characterize gene‐gene and gene‐environment interactions/correlations of complex diseases , 2011, Genetic epidemiology.

[23]  A Zhang,et al.  Modeling of environmental and genetic interactions with AMBROSIA, an information-theoretic model synthesis method , 2011, Heredity.

[24]  Jayaram Raghuram,et al.  Comparative analysis of methods for detecting interacting loci , 2011, BMC Genomics.

[25]  E. Lander,et al.  The mystery of missing heritability: Genetic interactions create phantom heritability , 2012, Proceedings of the National Academy of Sciences.

[26]  J. Knights,et al.  SYMPHONY, an information-theoretic method for gene–gene and gene–environment interaction analysis of disease syndromes , 2013, Heredity.

[27]  Min-Seok Kwon,et al.  A Modified Entropy-Based Approach for Identifying Gene-Gene Interactions in Case-Control Study , 2013, PloS one.

[28]  Ting Hu,et al.  An information-gain approach to detecting three-way epistatic interactions in genetic association studies , 2013, J. Am. Medical Informatics Assoc..

[29]  Xiaoyu Zuo,et al.  To Control False Positives in Gene-Gene Interaction Analysis: Two Novel Conditional Entropy-Based Approaches , 2013, PloS one.

[30]  David J. Galas,et al.  Discovering Pair-Wise Genetic Interactions: An Information Theory-Based Approach , 2014, PloS one.

[31]  Taesung Park,et al.  IGENT: efficient entropy based algorithm for genome-wide gene-gene interaction analysis , 2014, BMC Medical Genomics.

[32]  Ie-Bin Lian,et al.  Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions. , 2014, Gene.

[33]  Yuanke Zhang,et al.  EpiMiner: A three-stage co-information based method for detecting and visualizing epistatic interactions , 2014, Digit. Signal Process..

[34]  Lingtao Su,et al.  Research on Single Nucleotide Polymorphisms Interaction Detection from Network Perspective , 2015, PloS one.

[35]  David J. Galas,et al.  Biological Data Analysis as an Information Theory Problem: Multivariable Dependence Measures and the Shadows Algorithm , 2015, J. Comput. Biol..

[36]  Kristel Van Steen,et al.  A roadmap to multifactor dimensionality reduction methods , 2015, Briefings Bioinform..

[37]  David J. Galas,et al.  The Information Content of Discrete Functions and Their Application in Genetic Data Analysis , 2017, J. Comput. Biol..

[38]  Paola G. Ferrario,et al.  Transferring entropy to the realm of GxG interactions , 2016, Briefings Bioinform..