Genomic mining for complex disease traits with “random chemistry”

Our rapidly growing knowledge regarding genetic variation in the human genome offers great potential for understanding the genetic etiology of disease. This, in turn, could revolutionize detection, treatment, and in some cases prevention of disease. While genes for most of the rare monogenic diseases have already been discovered, most common diseases are complex traits, resulting from multiple gene–gene and gene-environment interactions. Detecting epistatic genetic interactions that predispose for disease is an important, but computationally daunting, task currently facing bioinformaticists. Here, we propose a new evolutionary approach that attempts to hill-climb from large sets of candidate epistatic genetic features to smaller sets, inspired by Kauffman’s “random chemistry” approach to detecting small auto-catalytic sets of molecules from within large sets. Although the algorithm is conceptually straightforward, its success hinges upon the creation of a fitness function able to discriminate large sets that contain subsets of interacting genetic features from those that don’t. Here, we employ an approximate and noisy fitness function based on the ReliefF data mining algorithm. We establish proof-of-concept using synthetic data sets, where individual features have no marginal effects. We show that the resulting algorithm can successfully detect epistatic pairs from up to 1,000 candidate single nucleotide polymorphisms in time that is linear in the size of the initial set, although success rate degrades as heritability declines. Research continues into seeking a more accurate fitness approximator for large sets and other algorithmic improvements that will enable us to extend the approach to larger data sets and to lower heritabilities.

[1]  Jason H. Moore,et al.  Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge , 2007 .

[2]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[3]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[4]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[5]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[6]  J. Ott,et al.  Neural network analysis of complex traits , 1997, Genetic epidemiology.

[7]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[8]  Terence Soule,et al.  Genetic Programming Theory and Practice IV , 2007 .

[9]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[10]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[11]  J. Ott,et al.  Statistical multilocus methods for disequilibrium analysis in complex traits , 2001, Human mutation.

[12]  Patrick C Phillips,et al.  The Opportunity for Canalization and the Evolution of Genetic Networks , 2004, The American Naturalist.

[13]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[14]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[15]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[16]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[17]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[18]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[19]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[20]  Leena Peltonen,et al.  Dissecting Human Disease in the Postgenomic Era , 2001, Science.

[21]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[22]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[23]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[24]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[25]  Harrison H. Barrett,et al.  Foundations of Image Science , 2003, J. Electronic Imaging.

[26]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[27]  A. Syvänen Accessing genetic variation: genotyping single nucleotide polymorphisms , 2001, Nature Reviews Genetics.

[28]  S. Kauffman At Home in the Universe: The Search for the Laws of Self-Organization and Complexity , 1995 .

[29]  K. Merikangas,et al.  Commentary: understanding sources of complexity in chronic diseases--the importance of integration of genetics and epidemiology. , 2006, International journal of epidemiology.

[30]  Jason H. Moore,et al.  A statistical comparison of grammatical evolution strategies in the domain of human genetics , 2005, 2005 IEEE Congress on Evolutionary Computation.

[31]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[32]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.