Sensible initialization using expert knowledge for genome-wide analysis of epistasis using genetic programming

For biomedical researchers it is now possible to measure large numbers of DNA sequence variations across the human genome. Measuring hundreds of thousands of variations is now routine, but single variations which consistently predict an individual's risk of common human disease have proven elusive. Instead of single variants determining the risk of common human diseases, it seems more likely that disease risk is best modeled by interactions between biological components. The evolutionary computing challenge now is to effectively explore interactions in these large datasets and identify combinations of variations which are robust predictors of common human diseases such as bladder cancer. One promising approach to this problem is genetic programming (GP). A GP approach for this problem will use darwinian inspired evolution to evolve programs which find and model attribute interactions which predict an individual's risk of common human diseases. The goal of this study is to develop and evaluate two initializers for this domain. We develop a probabilistic initializer which uses expert knowledge to select attributes and an enumerative initializer which maximizes attribute diversity in the generated population.We compare these initializers to a random initializer which displays no preference for attributes. We show that the expert-knowledge-aware probabilistic initializer significantly outperforms both the random initializer and the enumerative initializer.We discuss implications of these results for the design of GP strategies which are able to detect and characterize predictors of common human diseases.

[1]  S. Cichon,et al.  A genome-wide association study implicates diacylglycerol kinase eta (DGKH) and several other genes in the etiology of bipolar disorder , 2008, Molecular Psychiatry.

[2]  Arend Hintze,et al.  Evolution of Complex Modular Biological Networks , 2007, PLoS Comput. Biol..

[3]  Jason H. Moore,et al.  An Expert Knowledge-Guided Mutation Operator for Genome-Wide Genetic Analysis Using Genetic Programming , 2007, PRIB.

[4]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[5]  Oliver Sieber,et al.  A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21 , 2007, Nature Genetics.

[6]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[7]  Marcia M. Nizzari,et al.  Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels , 2007, Science.

[8]  P. Fearnhead,et al.  Genome-wide association study of prostate cancer identifies a second risk locus at 8q24 , 2007, Nature Genetics.

[9]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[10]  Jiang Gui,et al.  Symbolic Modeling of Epistasis , 2007, Human Heredity.

[11]  Alfonso Rodríguez-Patón,et al.  Initialization method for grammar-guided genetic programming , 2006, Knowl. Based Syst..

[12]  Jason H. Moore,et al.  Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge , 2007 .

[13]  Jason H. Moore,et al.  Exploiting Expert Knowledge in Genetic Programming for Genome-Wide Genetic Analysis , 2006, PPSN.

[14]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[15]  Jason H. Moore,et al.  A statistical comparison of grammatical evolution strategies in the domain of human genetics , 2005, 2005 IEEE Congress on Evolutionary Computation.

[16]  R. Nagel,et al.  Epistasis and the genetics of human diseases. , 2005, Comptes rendus biologies.

[17]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[18]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[19]  David M. Reif,et al.  Combinatorial Pharmacogenetics , 2005, Nature Reviews Drug Discovery.

[20]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[21]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[22]  John R. Koza,et al.  Genetic Programming IV: Routine Human-Competitive Machine Intelligence , 2003 .

[23]  Michael O'Neill,et al.  Grammatical evolution - evolutionary automatic programming in an arbitrary language , 2003, Genetic programming.

[24]  David Corne,et al.  Evolutionary Computation In Bioinformatics , 2003 .

[25]  G. Hopkinson,et al.  A radiation tolerant video camera for high total dose environments , 2002, IEEE Radiation Effects Data Workshop.

[26]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[27]  Jason H. Moore,et al.  Symbolic discriminant analysis of microarray data in autoimmune disease , 2002, Genetic epidemiology.

[28]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[29]  Riccardo Poli,et al.  Foundations of Genetic Programming , 1999, Springer Berlin Heidelberg.

[30]  David E. Goldberg,et al.  The Design of Innovation: Lessons from and for Competent Genetic Algorithms , 2002 .

[31]  Sean Luke,et al.  A survey and comparison of tree generation algorithms , 2001 .

[32]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[33]  Nikolay I. Nikolaev,et al.  Genetic Programming and Data Structures: Genetic Programming+Data Structures=Automatic Programming , 2001, Softw. Focus.

[34]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[35]  Sean Luke,et al.  Two fast tree-creation algorithms for genetic programming , 2000, IEEE Trans. Evol. Comput..

[36]  John R. Koza,et al.  Genetic Programming III: Darwinian Invention & Problem Solving , 1999 .

[37]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[38]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[39]  John R. Koza,et al.  Genetic programming 2 - automatic discovery of reusable programs , 1994, Complex adaptive systems.

[40]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[41]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[42]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.