Analysis of multiple SNPs in genetic association studies: comparison of three multi‐locus methods to prioritize and select SNPs

Nonparametric approaches have been developed that are able to analyze large numbers of single nucleotide polymorphisms (SNPs) in modest sample sizes. These approaches have different selection features and may not provide similar results when applied to the same dataset. Therefore, we compared the results of three approaches (set association, random forests and multifactor dimensionality reduction [MDR]) to select from a total of 93 candidate SNPs a subset of SNPs that are important in determining high‐density lipoprotein (HDL)‐cholesterol levels. The study population consisted of a random sample from a Dutch monitoring project for cardiovascular disease risk factors and was dichotomized into cases (low HDL‐cholesterol, n = 533) and non‐cases (high HDL‐cholesterol, n = 545) based on gender‐specific median values for HDL cholesterol. Clearly, all three approaches prioritized three SNPs as important (CETP Taq1B, CETP−629 C/A and LPL Ser447X). Two SNPs with weaker main effects were additionally prioritized by random forests (APOC3 3175 G/C and CCR2 Val62Ile), whereas MTHFR 677 C/T was selected in combination with CETP Taq1B as best model by MDR. Obtained p‐values for the selected models were significant for the set association approach (p =.0019), random forests (p<.01) and MDR (p<.02). In conclusion, the application of a combination of multi‐locus methods is a useful approach in genetic association studies to select a well‐defined set of important SNPs for further statistical and epidemiological interpretation, providing increased confidence and more information compared with the application of only one method. Genet. Epidemiol. 2007. © 2007 Wiley‐Liss, Inc.

[1]  Milos Hauskrecht,et al.  ORIGINAL RESEARCH Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive , 2022 .

[2]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[3]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[4]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[5]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[6]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[7]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[8]  J. H. Moore,et al.  Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus , 2004, Diabetologia.

[9]  John Draper,et al.  Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals , 2006, Proceedings of the National Academy of Sciences.

[10]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[11]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[12]  Ivan Bratko,et al.  Attribute Interactions in Medical Data Analysis , 2003, AIME.

[13]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[14]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[15]  Ivan Bratko,et al.  Microarray data mining with visual programming , 2005, Bioinform..

[16]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[17]  M. Province,et al.  Classification methods for confronting heterogeneity. , 2001, Advances in genetics.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[20]  W. Klitz,et al.  Gene interactions and stroke risk in children with sickle cell anemia. , 2004, Blood.

[21]  D. D. de Quervain,et al.  Glucocorticoid-related genetic susceptibility for Alzheimer's disease. , 2003, Human molecular genetics.

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  T. Kimman,et al.  Association of severe respiratory syncytial virus bronchiolitis with interleukin-4 and interleukin-4 receptor alpha polymorphisms. , 2003, The Journal of infectious diseases.

[24]  R. Hui,et al.  Hyperhomocysteinemia Decreases Circulating High-Density Lipoprotein by Inhibiting Apolipoprotein A-I Protein Synthesis and Enhancing HDL Cholesterol Clearance , 2006, Circulation research.

[25]  Ivan Bratko,et al.  Analyzing Attribute Dependencies , 2003, PKDD.

[26]  David M. Reif,et al.  Combinatorial Pharmacogenetics , 2005, Nature Reviews Drug Discovery.

[27]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[28]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[29]  M. Lopes-Virella,et al.  Cholesterol determination in high-density lipoproteins separated by three different methods. , 1977, Clinical chemistry.

[30]  D. Kromhout,et al.  Total and HDL-cholesterol in The Netherlands: 1987-1992. Levels and changes over time in relation to age, gender and educational level. , 1994, International journal of epidemiology.

[31]  J. Ordovás HDL Genetics: Candidate Genes, Genome Wide Scans and Gene-Environment Interactions , 2002, Cardiovascular Drugs and Therapy.

[32]  G. Möller,et al.  Multicentre Study of a New Enzymatic Method of Cholesterol Determination , 1984, Journal of clinical chemistry and clinical biochemistry. Zeitschrift fur klinische Chemie und klinische Biochemie.

[33]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[34]  Jason H. Moore,et al.  The Interaction of Four Genes in the Inflammation Pathway Significantly Predicts Prostate Cancer Risk , 2005, Cancer Epidemiology Biomarkers & Prevention.

[35]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[36]  J. Ott,et al.  Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease , 2002, The Pharmacogenomics Journal.

[37]  J. Joven,et al.  The MTHFR C677T, APOE, and PON55 gene polymorphisms show relevant interactions with cardiovascular risk factors. , 2002, Clinical chemistry.

[38]  M. Kendall Theoretical Statistics , 1956, Nature.

[39]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.