SNP Subset Selection for Genetic Association Studies

Association studies for disease susceptibility genes rely on the high density of SNPs within candidate genes. However, the linkage disequilibrium between SNPs imply that not all SNPs identified in the candidate region need be genotyped. Here we develop several approaches to SNP subset selection, which can substantially reduce the number of SNPs to be genotyped in an association study. We apply clustering algorithms to pairwise linkage disequilibrium measures, with SNP subsets determined for different cut‐off values of Δ using nearest and furthest neighbour clusters. Alternatively, SNP subsets may be determined by the proportion of haplotypes they identify. We also show how power calculations, based on the average power to identify a SNP as the disease susceptibility mutation using haplotype‐based or logistic regression based statistical analyses, can be used to choose SNP subsets. All these methods provide a ranking method for subsets of a specific size, but do not provide criteria for overall choice of SNP subset size. We develop such criteria by incorporating power calculations into a decision analysis, where the choice of SNP subset size depends on the genotyping costs and the perceived benefits of identifying association. These methods are illustrated using eleven SNPs in the MMP2 gene.

[1]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[2]  R. Berk,et al.  Continuous Univariate Distributions, Volume 2 , 1995 .

[3]  R. Fisher The Advanced Theory of Statistics , 1943, Nature.

[4]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[5]  Francis S. Collins,et al.  Variations on a Theme: Cataloging Human DNA Sequence Variation , 1997, Science.

[6]  P. Sham Statistics in human genetics , 1997 .

[7]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[8]  Maurice G. Kendall,et al.  The Advanced Theory of Statistics, Vol. 2: Inference and Relationship , 1979 .

[9]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[10]  D. G. Beech,et al.  The Advanced Theory of Statistics. Volume 2: Inference and Relationship. , 1962 .

[11]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[12]  D. Schuppan,et al.  MMPs in the gut: inflammation hits the matrix , 2000, Gut.

[13]  L. Cardon,et al.  A genomewide analysis provides evidence for novel linkages in inflammatory bowel disease in a large European cohort. , 1999, American journal of human genetics.

[14]  E Lai,et al.  The extent of linkage disequilibrium in four populations with distinct demographic histories. , 2000, American journal of human genetics.

[15]  C. J. Taylor,et al.  Matrix metalloproteinase levels are elevated in inflammatory bowel disease. , 1999, Gastroenterology.