GP-Pi: Using Genetic Programming with Penalization and Initialization on Genome-Wide Association Study

The advancement of chip-based technology has enabled the measurement of millions of DNA sequence variations across the human genome. Experiments revealed that high-order, but not individual, interactions of single nucleotide polymorphisms (SNPs) are responsible for complex diseases such as cancer. The challenge of genome-wide association studies (GWASs) is to sift through high-dimensional datasets to find out particular combinations of SNPs that are predictive of these diseases. Genetic Programming (GP) has been widely applied in GWASs. It serves two purposes: attribute selection and/or discriminative modeling. One advantage of discriminative modeling over attribute selection lies in interpretability. However, existing discriminative modeling algorithms do not scale up well with the increase in the SNP dimension. Here, we have developed GP-Pi. We have introduced a penalizing term in the fitness function to penalize trees with common SNPs and an initializer which utilizes expert knowledge to seed the population with good attributes. Experimental results on simulated data suggested that GP-Pi outperforms GPAS with statistically significance. GP-Pi was further evaluated on a real GWAS dataset of Rheumatoid Arthritis, obtained from the North American Rheumatoid Arthritis Consortium. Our results, with potential new discoveries, are found to be consistent with literature.

[1]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[2]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[3]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[4]  Luc De Raedt,et al.  Machine Learning: ECML-94 , 1994, Lecture Notes in Computer Science.

[5]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[6]  Bachti Alisjahbana,et al.  A genome wide association study of pulmonary tuberculosis susceptibility in Indonesians , 2012, BMC Medical Genetics.

[7]  Scott M. Williams,et al.  challenges for genome-wide association studies , 2010 .

[8]  Kwong-Sak Leung,et al.  Challenges rising from learning motif evaluation functions using genetic programming , 2010, GECCO '10.

[9]  M. Kon,et al.  Genome-wide association studies. , 2013, Methods in molecular biology.

[10]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[11]  Jing Cui,et al.  Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci , 2010, Nature Genetics.

[12]  Jason H. Moore,et al.  Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge , 2007 .

[13]  Madhukar Pai,et al.  Particular HLA-DRB1 shared epitope genotypes are strongly associated with rheumatoid vasculitis. , 2004, Arthritis and rheumatism.

[14]  Shu-Heng Chen,et al.  Genetic Algorithms and Genetic Programming in Computational Finance , 2002 .

[15]  Lothar Thiele,et al.  Multiobjective genetic programming: reducing bloat using SPEA2 , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[16]  Jason H. Moore,et al.  An Expert Knowledge-Guided Mutation Operator for Genome-Wide Genetic Analysis Using Genetic Programming , 2007, PRIB.

[17]  Kwong-Sak Leung,et al.  Generalizing and learning protein-DNA binding sequence representations by an evolutionary algorithm , 2011, Soft Comput..

[18]  Enrique Hernández-Lemus,et al.  GPDTI: A Genetic Programming Decision Tree Induction method to find epistatic effects in common complex diseases , 2007, ISMB/ECCB.

[19]  M. Lockshin,et al.  Arthritis & rheumatism +/- 50. , 2008, Arthritis & Rheumatism.

[20]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[21]  Ralph Arnote,et al.  Hong Kong (China) , 1996, OECD/G20 Base Erosion and Profit Shifting Project.

[22]  Jason H. Moore,et al.  Sensible initialization using expert knowledge for genome-wide analysis of epistasis using genetic programming , 2009, 2009 IEEE Congress on Evolutionary Computation.

[23]  Ingo Wegener,et al.  Detecting high-order interactions of single nucleotide polymorphisms using genetic programming , 2007, Bioinform..

[24]  Y. Kong,et al.  Mapping a dynamic innate immunity protein interaction network regulating type I interferon production. , 2011, Immunity.

[25]  Martin C. Martin,et al.  Genetic programming for real world robot vision , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[27]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[28]  William B. Langdon,et al.  Genetic Programming in Data Mining for Drug Discovery , 2005 .

[29]  E. Lander,et al.  On the allelic spectrum of human disease. , 2001, Trends in genetics : TIG.

[30]  Jason H. Moore,et al.  GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures , 2012, BioData Mining.