The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases

Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies.In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted.Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases.

[1]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[2]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[3]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[4]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  D Hurnik,et al.  An overview of techniques for dealing with large numbers of independent variables in epidemiologic studies. , 1997, Preventive veterinary medicine.

[7]  J. Ott,et al.  Neural network analysis of complex traits , 1997, Genetic epidemiology.

[8]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[9]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate in behavior genetics research , 2001, Behavioural Brain Research.

[11]  M. Province,et al.  Classification methods for confronting heterogeneity. , 2001, Advances in genetics.

[12]  J. Ott,et al.  Statistical multilocus methods for disequilibrium analysis in complex traits , 2001, Human mutation.

[13]  N. Schork,et al.  The future of genetic case-control studies. , 2001, Advances in genetics.

[14]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[15]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[16]  J. Ott,et al.  Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease , 2002, The Pharmacogenomics Journal.

[17]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[18]  Jürg Ott,et al.  Set association analysis of SNP case-control and microarray data , 2002, RECOMB '02.

[19]  J H Moore,et al.  A comparison of combinatorial partitioning and linear regression for the detection of epistatic effects of the ACE I/D and PAI‐1 4G/5G polymorphisms on plasma PAI‐1 levels , 2002, Clinical genetics.

[20]  J. H. Moore,et al.  The relationship between plasma t‐PA and PAI‐1 levels is dependent on epistatic effects of the ACE I/D and PAI‐1 4G/5G polymorphisms , 2002, Clinical genetics.

[21]  Scott M. Williams,et al.  New strategies for identifying gene-gene interactions in hypertension , 2002, Annals of medicine.

[22]  Jason H. Moore,et al.  An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene Interactions on risk of myocardial infarction: The importance of model validation , 2004, BMC Bioinformatics.

[23]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[24]  J. H. Moore,et al.  Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus , 2004, Diabetologia.

[25]  D Curtis,et al.  Assessing Optimal Neural Network Architecture for Identifying Disease‐associated Multi‐marker Genotypes using a Permutation Test, and Application to Calpain 10 Polymorphisms Associated with Diabetes , 2003, Annals of human genetics.

[26]  Jurg Ott,et al.  Sum statistics for the joint detection of multiple disease loci in case‐control association studies with SNP markers , 2003, Genetic epidemiology.

[27]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[28]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.

[29]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[30]  Hiroyuki Honda,et al.  Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of prediction model on childhood allergic asthma , 2004, BMC Bioinformatics.

[31]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[32]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[33]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[34]  Lang Li,et al.  Selecting pre‐screening items for early intervention trials of dementia—a case study , 2004, Statistics in medicine.

[35]  D. D. de Quervain,et al.  Glucocorticoid-related genetic susceptibility for Alzheimer's disease. , 2003, Human molecular genetics.

[36]  William Shannon,et al.  Detecting epistatic interactions contributing to quantitative traits , 2004, Genetic epidemiology.

[37]  H. K. Lee,et al.  Erratum to: Common genetic polymorphisms in the promoter of resistin gene are major determinants of plasma resistin concentrations in humans , 2004, Diabetologia.

[38]  Marylyn D. Ritchie,et al.  Multilocus Analysis of Hypertension: A Hierarchical Approach , 2004, Human Heredity.

[39]  Jeroen Smits,et al.  Testing goodness‐of‐fit of the logistic regression model in case–control studies using sample reweighting , 2005, Statistics in medicine.

[40]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[41]  David M. Reif,et al.  Combinatorial Pharmacogenetics , 2005, Nature Reviews Drug Discovery.

[42]  Marylyn D. Ritchie,et al.  GPNN: Power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease , 2006, BMC Bioinformatics.

[43]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[44]  Jason H. Moore,et al.  The Interaction of Four Genes in the Inflammation Pathway Significantly Predicts Prostate Cancer Risk , 2005, Cancer Epidemiology Biomarkers & Prevention.

[45]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.