Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies.

Large exploratory studies, including candidate-gene-association testing, genomewide linkage-disequilibrium scans, and array-expression experiments, are becoming increasingly common. A serious problem for such studies is that statistical power is compromised by the need to control the false-positive rate for a large family of tests. Because multiple true associations are anticipated, methods have been proposed that combine evidence from the most significant tests, as a more powerful alternative to individually adjusted tests. The practical application of these methods is currently limited by a reliance on permutation testing to account for the correlated nature of single-nucleotide polymorphism (SNP)-association data. On a genomewide scale, this is both very time-consuming and impractical for repeated explorations with standard marker panels. Here, we alleviate these problems by fitting analytic distributions to the empirical distribution of combined evidence. We fit extreme-value distributions for fixed lengths of combined evidence and a beta distribution for the most significant length. An initial phase of permutation sampling is required to fit these distributions, but it can be completed more quickly than a simple permutation test and need be done only once for each panel of tests, after which the fitted parameters give a reusable calibration of the panel. Our approach is also a more efficient alternative to a standard permutation test. We demonstrate the accuracy of our approach and compare its efficiency with that of permutation tests on genomewide SNP data released by the International HapMap Consortium. The estimation of analytic distributions for combined evidence will allow these powerful methods to be applied more widely in large exploratory studies.

[1]  Frank Dudbridge,et al.  Rank truncated product of P‐values, with application to genomewide association scans , 2003, Genetic epidemiology.

[2]  B W Brown,et al.  Methods of correcting for multiple testing: operating characteristics. , 1997, Statistics in medicine.

[3]  D. Clayton,et al.  A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. , 2002, American journal of human genetics.

[4]  Ivo Grosse,et al.  Gene selection criterion for discriminant microarray data analysis based on extreme value distributions , 2003, RECOMB '03.

[5]  E. Gumbel,et al.  Statistics of extremes , 1960 .

[6]  Nelson B Freimer,et al.  Genomewide linkage disequilibrium mapping of severe bipolar disorder in a population isolate. , 2002, American journal of human genetics.

[7]  J. Todd,et al.  Limitations of stratifying sib-pair data in common disease linkage studies: an example using chromosome 10p14-10q11 in type 1 diabetes. , 2002, American journal of medical genetics.

[8]  A. Schulze,et al.  Navigating gene expression using microarrays — a technology review , 2001, Nature Cell Biology.

[9]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[10]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[11]  Nelson B Freimer,et al.  Cost-effective designs for linkage disequilibrium mapping of complex traits. , 2003, American journal of human genetics.

[12]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[13]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[14]  Paul Schliekelman,et al.  Multiplex relative risk and estimation of the number of loci underlying an inherited disease. , 2002, American journal of human genetics.

[15]  Xiping Xu,et al.  Power estimation of multiple SNP association test of case‐control study and application , 2004, Genetic epidemiology.

[16]  Y. Ohnishi,et al.  Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction , 2003, Nature Genetics.

[17]  Dmitri V. Zaykin,et al.  Statistical Analysis of Genetic Associations , 1999 .

[18]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[19]  William Noble Grundy,et al.  Classifying proteins by family using the product of correlated p-values , 1999, RECOMB.

[20]  J. Cheverud,et al.  A simple correction for multiple comparisons in interval mapping genome scans , 2001, Heredity.

[21]  Melissa A. Austin,et al.  Genebanks: A Comparison of Eight Proposed International Genetic Databases , 2003, Public Health Genomics.

[22]  R. Doerge,et al.  Empirical threshold values for quantitative trait mapping. , 1994, Genetics.

[23]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[24]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[25]  Nicholas W Wood,et al.  Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. , 2003, American journal of human genetics.

[26]  F. Pesarin Multivariate Permutation Tests : With Applications in Biostatistics , 2001 .

[27]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[28]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[29]  Jurg Ott,et al.  Sum statistics for the joint detection of multiple disease loci in case‐control association studies with SNP markers , 2003, Genetic epidemiology.

[30]  Chiara Sabatti,et al.  False discovery rate in linkage and association genome screens for complex disorders. , 2003, Genetics.

[31]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[32]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[34]  F. Dudbridge Pedigree disequilibrium tests for multilocus haplotypes , 2003, Genetic epidemiology.

[35]  P. Goodfellow,et al.  A whole genome screen for linkage disequilibrium in multiple sclerosis confirms disease associations with regions previously linked to susceptibility. , 2002, Brain : a journal of neurology.

[36]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies , 2002, Biometrics.

[37]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[40]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[41]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[42]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.