SYMPHONY, an information-theoretic method for gene–gene and gene–environment interaction analysis of disease syndromes

We develop an information-theoretic method for gene–gene (GGI) and gene–environmental interactions (GEI) analysis of syndromes, defined as a phenotype vector comprising multiple quantitative traits (QTs). The K-way interaction information (KWII), an information-theoretic metric, was derived for multivariate normal distributed phenotype vectors. The utility of the method was challenged with three simulated data sets, the Genetic Association Workshop-15 (GAW15) rheumatoid arthritis data set, a high-density lipoprotein (HDL) and atherosclerosis data set from a mouse QT locus study, and the 1000 Genomes data. The dependence of the KWII on effect size, minor allele frequency, linkage disequilibrium, population stratification/admixture, as well as the power and computational time requirements of the novel method was systematically assessed in simulation studies. In these studies, phenotype vectors containing two and three constituent multivariate normally distributed QTs were used and the KWII was found to be effective at detecting GEI associated with the phenotype. High KWII values were observed for variables and variable combinations associated with the syndrome phenotype compared with uninformative variables not associated with the phenotype. The KWII values for the phenotype-associated combinations increased monotonically with increasing effect size values. The KWII also exhibited utility in simulations with non-linear dependence between the constituent QTs. Analysis of the HDL and atherosclerosis data set indicated that the simultaneous analysis of both phenotypes identified interactions not detected in the analysis of the individual traits. The information-theoretic approach may be useful for non-parametric analysis of GGI and GEI of complex syndromes.

[1]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[2]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[3]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[4]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[5]  Te Sun Han,et al.  Multiple Mutual Informations and Multiple Interactions in Frequency Data , 1980, Inf. Control..

[6]  Anil K. Jain,et al.  A Test to Determine the Multivariate Normality of a Data Set , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  D. V. Gokhale,et al.  Entropy expressions and their estimators for multivariate distributions , 1989, IEEE Trans. Inf. Theory.

[8]  C E Shannon,et al.  The mathematical theory of communication. 1963. , 1997, M.D. computing : computers in medical practice.

[9]  E. Martin,et al.  A test for linkage and association in general pedigrees: the pedigree disequilibrium test. , 2000, American journal of human genetics.

[10]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[11]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[12]  Norbert Henze,et al.  Invariant tests for multivariate normality: a critical review , 2002 .

[13]  W. Dietz,et al.  Prevalence of the metabolic syndrome among US adults: findings from the third National Health and Nutrition Examination Survey. , 2002, JAMA.

[14]  C. Way Stedman's Concise Medical Dictionary for the Health Professions , 2002 .

[15]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[16]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[17]  G. Churchill,et al.  Quantitative Trait Loci Analysis for Plasma HDL-Cholesterol Concentrations and Atherosclerosis Susceptibility Between Inbred Mouse Strains C 57 BL / 6 J and 129 S 1 / SvImJ , 2003 .

[18]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[19]  Gary A. Churchill,et al.  Quantitative Trait Loci Analysis for Plasma HDL-Cholesterol Concentrations and Atherosclerosis Susceptibility Between Inbred Mouse Strains C57BL/6J and 129S1/SvImJ , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[20]  William Shannon,et al.  Detecting epistatic interactions contributing to quantitative traits , 2004, Genetic epidemiology.

[21]  Aleks Jakulin Machine Learning Based on Attribute Interactions , 2005 .

[22]  Marylyn D. Ritchie,et al.  Parallel multifactor dimensionality reduction: a tool for the large-scale analysis of gene-gene interactions , 2006, Bioinform..

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[24]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[25]  J. H. Moore,et al.  A novel method to identify gene–gene effects in nuclear families: the MDR‐PDT , 2006, Genetic epidemiology.

[26]  Aidong Zhang,et al.  Information-theoretic metrics for visualizing gene-environment interactions. , 2007, American journal of human genetics.

[27]  Na Li,et al.  Genetic Analysis Workshop 15: simulation of a complex genetic model for rheumatoid arthritis in nuclear families including a dense SNP map with linkage disequilibrium between marker loci and trait loci , 2007, BMC Proceedings.

[28]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[29]  P. Chanda,et al.  AMBIENCE: A Novel Approach and Efficient Algorithm for Identifying Informative Genetic and Environmental Associations With Complex Phenotypes , 2008, Genetics.

[30]  Aidong Zhang,et al.  Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits , 2009, BMC Genomics.

[31]  Li Wang,et al.  Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects , 2009, BMC proceedings.

[32]  Xue-wen Chen,et al.  A Markov blanket-based method for detecting causal SNPs in GWAS , 2010, BMC Bioinformatics.

[33]  Costas S. Iliopoulos,et al.  An algorithm for mapping short reads to a dynamically changing genomic sequence , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[34]  P. Chanda,et al.  Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity , 2010, BMC Genomics.

[35]  J. Knights,et al.  An Information Theory Analysis of Gene-Environmental Interactions in Count/Rate Data , 2012, Human Heredity.

[36]  Tejas A. Desai,et al.  On Testing for Multivariate Normality , 2013 .