A support vector machine approach for detecting gene‐gene interaction

Although genetic factors play an important role in most human diseases, multiple genes or genes and environmental factors may influence individual risk. In order to understand the underlying biological mechanisms of complex diseases, it is important to understand the complex relationships that control the process. In this paper, we consider different perspectives, from each optimization, complexity analysis, and algorithmic design, which allows us to describe a reasonable and applicable computational framework for detecting gene‐gene interactions. Accordingly, support vector machine and combinatorial optimization techniques (local search and genetic algorithm) were tailored to fit within this framework. Although the proposed approach is computationally expensive, our results indicate this is a promising tool for the identification and characterization of high order gene‐gene and gene‐environment interactions. We have demonstrated several advantages of this method, including the strong power for classification, less concern for overfitting, and the ability to handle unbalanced data and achieve more stable models. We would like to make the support vector machine and combinatorial optimization techniques more accessible to genetic epidemiologists, and to promote the use and extension of these powerful approaches. Genet. Epidemiol. 2008. © 2007 Wiley‐Liss, Inc.

[1]  Jason H. Moore,et al.  The Interaction of Four Genes in the Inflammation Pathway Significantly Predicts Prostate Cancer Risk , 2005, Cancer Epidemiology Biomarkers & Prevention.

[2]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.

[3]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[4]  Mitsuo Gen,et al.  Genetic algorithms and engineering optimization , 1999 .

[5]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[8]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[9]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[10]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[11]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[12]  J. Listgarten,et al.  Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms , 2004, Clinical Cancer Research.

[13]  Arthur P. Mange,et al.  Basic Human Genetics , 1993 .

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  T. Beaty,et al.  Multipoint linkage disequilibrium mapping for complex diseases , 2003, Genetic epidemiology.

[16]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[17]  D. Clayton,et al.  Genetic association studies , 2005, The Lancet.

[18]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[19]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[20]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[21]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[22]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[23]  J. Ott,et al.  Neural network analysis of complex traits , 1997, Genetic epidemiology.

[24]  Jason H. Moore,et al.  Application Of Genetic Algorithms To The Discovery Of Complex Models For Simulation Studies In Human Genetics , 2002, GECCO.

[25]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.