A genetic ensemble approach for gene-gene interaction identification

BackgroundIt has now become clear that gene-gene interactions and gene-environment interactions are ubiquitous and fundamental mechanisms for the development of complex diseases. Though a considerable effort has been put into developing statistical models and algorithmic strategies for identifying such interactions, the accurate identification of those genetic interactions has been proven to be very challenging.MethodsIn this paper, we propose a new approach for identifying such gene-gene and gene-environment interactions underlying complex diseases. This is a hybrid algorithm and it combines genetic algorithm (GA) and an ensemble of classifiers (called genetic ensemble). Using this approach, the original problem of SNP interaction identification is converted into a data mining problem of combinatorial feature selection. By collecting various single nucleotide polymorphisms (SNP) subsets as well as environmental factors generated in multiple GA runs, patterns of gene-gene and gene-environment interactions can be extracted using a simple combinatorial ranking method. Also considered in this study is the idea of combining identification results obtained from multiple algorithms. A novel formula based on pairwise double fault is designed to quantify the degree of complementarity.ConclusionsOur simulation study demonstrates that the proposed genetic ensemble algorithm has comparable identification power to Multifactor Dimensionality Reduction (MDR) and is slightly better than Polymorphism Interaction Analysis (PIA), which are the two most popular methods for gene-gene interaction identification. More importantly, the identification results generated by using our genetic ensemble algorithm are highly complementary to those obtained by PIA and MDR. Experimental results from our simulation studies and real world data application also confirm the effectiveness of the proposed genetic ensemble algorithm, as well as the potential benefits of combining identification results from different algorithms.

[1]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[2]  D. Thomas,et al.  Gene–environment-wide association studies: emerging approaches , 2010, Nature Reviews Genetics.

[3]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[4]  A. Balmain,et al.  Systems genetics analysis of cancer susceptibility: from mouse models to humans , 2009, Nature Reviews Genetics.

[5]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[6]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[7]  Bogdan Gabrys,et al.  Application of the Evolutionary Algorithms for Classifier Selection in Multiple Classifier Systems with Majority Voting , 2001, Multiple Classifier Systems.

[8]  G. Bontempi,et al.  A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  M. Ehm,et al.  Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. , 1998, American journal of human genetics.

[10]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  D. Allison,et al.  Detection of gene x gene interactions in genome-wide association studies of human population data. , 2007, Human heredity.

[13]  Christian Gieger,et al.  A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization , 2006, Nature Genetics.

[14]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[15]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[16]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[17]  Laurent Briollais,et al.  Methodological issues in detecting gene-gene interactions in breast cancer susceptibility: a population-based study in Ontario , 2007, BMC medicine.

[18]  B. McKinney,et al.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis , 2009, PLoS genetics.

[19]  Sara A. Solla,et al.  Multi-Locus Nonparametric Linkage Analysis of Complex Trait Loci with Neural Networks , 1998, Human Heredity.

[20]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[21]  J. Gilbert,et al.  Complement Factor H Variant Increases the Risk of Age-Related Macular Degeneration , 2005, Science.

[22]  Zili Zhang,et al.  An Ensemble of Classifiers with Genetic Algorithm Based Feature Selection , 2008, IEEE Intell. Informatics Bull..

[23]  Jason H. Moore,et al.  Application Of Genetic Algorithms To The Discovery Of Complex Models For Simulation Studies In Human Genetics , 2002, GECCO.

[24]  Heping Zhang,et al.  A forest-based approach to identifying gene and gene–gene interactions , 2007, Proceedings of the National Academy of Sciences.

[25]  K. Lunetta,et al.  The neuronal sortilin-related receptor SORL1 is genetically associated with Alzheimer disease , 2007, Nature Genetics.

[26]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[27]  Mykola Pechenizkiy,et al.  Diversity in search strategies for ensemble feature selection , 2005, Inf. Fusion.

[28]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[29]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[30]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[31]  Alison A Motsinger,et al.  Multifactor dimensionality reduction: An analysis strategy for modelling and detecting gene - gene interactions in human genetics and pharmacogenomics studies , 2006, Human Genomics.

[32]  M W Kattan,et al.  Determining the Area under the ROC Curve for a Binary Diagnostic Test , 2000, Medical decision making : an international journal of the Society for Medical Decision Making.

[33]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[34]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[35]  Stephen J. Chanock,et al.  Polymorphism Interaction Analysis (PIA): a method for investigating complex gene-gene interactions , 2008, BMC Bioinformatics.

[36]  Lakhmi C. Jain,et al.  Designing classifier fusion systems by genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[37]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[38]  BMC Bioinformatics , 2005 .

[39]  Hiroyuki Honda,et al.  Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of prediction model on childhood allergic asthma , 2004, BMC Bioinformatics.

[40]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[41]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[42]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[43]  J. Haines,et al.  Cigarette smoking strongly modifies the association of LOC387715 and age-related macular degeneration. , 2006, American journal of human genetics.

[44]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[45]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[46]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[47]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[48]  S. Fisher,et al.  Assessment of the contribution of CFH and chromosome 10q26 AMD susceptibility loci in a Russian population isolate , 2006, British Journal of Ophthalmology.

[49]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[50]  Bogdan Gabrys,et al.  Classifier selection for majority voting , 2005, Inf. Fusion.