Adjustments for multiple testing are rare in situations where a modest number of markers are typed, such as candidate genes studies. Based on a simulation study, we argue that this can result in a considerable risk of false discoveries, but that a simple and efficient method can be used to counteract this. We noted that in studies where relatively few tests are performed, such as candidate genes studies, adjustments for multiple testing are rare. The reasons are not clear. As few tests are performed compared to linkage (disequilibrium) scans, the risk of false discoveries is perhaps perceived to be small. Geneticists could also be reluctant to correct for multiple testing because this may result in a low power to detect real effects. Furthermore, markers in candidate gene studies are often in linkage disequilibrium and analyzed in multiple ways (eg as single markers or part of haplotypes). Procedures to control false discoveries are sometimes considered inappropriate for such correlated tests. The pattern that emerges from the analyses may sometimes seem meaningful. For instance, finding a more significant haplotype could be viewed as a confirmation that a significant single marker was not a false discovery. Researchers might feel that methods for controlling false discoveries are too conservative since they cannot take this kind of information into account. The same might be true for other forms of ‘circumstantial evidence,’ such as a significant result for markers that are functional or show similar genotype–phenotype associations in other species. To study the validity of these reasons, we simulated data for 50 000 candidate gene studies based on coalescent processes (details at: http://www.vipbg.vcu.edu/~edwin/). In each study, there were five common SNPs (frequencies40.1), approximately evenly spaced in a 20 kb area. We tested (1) each SNP, (2) the overall haplotype distribution, (3) each common haplotype (frequency42%), and (4) each of the 10 two marker haplotypes. This creates a multiple testing problem with correlated tests. For instance, with four common haplotypes the total number of tests equals 5þ 1þ4þ101⁄420. These tests are correlated because the SNPs are in linkage disequilibrium and the same SNP may be tested as a single marker or part of a haplotype. If all tests with P-values smaller than critical value Pk are rejected, making no adjustment for multiple testing implies Pk1⁄4 a. The Bonferroni correction, Pk1⁄4 a/m, where m is the number of tests, guarantees that the expected number of studies yielding one or more false discoveries ra. Instead of controlling a number, it may be better to control a false discovery rate (FDR). The FDR equals the expected proportion of false discoveries among all significant tests. For example, FDR1⁄4 0.1 means that there will on average be one false discovery for every 10 significant findings. We studied three sequential P-value methods to control the FDR. All are easy to apply by ordering the P-values and then use a simple rule to determine the significant tests. If there are no real effects, the FDR controlled by these methods offers the same ‘studywise’ control as the Bonferroni correction. However, these procedures will be more powerful than the Bonferroni correction if there are true effects. The sequential P-value methods may be too liberal with a limited number of tests. We considered other methods that try to remedy this, but they did not work well. The so-called BH procedure controls the FDR at a pre-specified level q for uncorrelated and positively dependent test statistics. If the proportion p0 of tests that do not have effects o1, the procedure is conservative and controls the FDR at level q p0 rather than q. The adaptive procedure remedies this by estimating p0 and then running the BH procedure with q*1⁄4 q/p0. Finally, the dependent procedure is known to be less powerful than the BH and adaptive procedure, but is valid under any dependency structure. Table 1 shows that, without a correction for multiple testing, 24.4% of the studies will yield at least one significant result (a1⁄45%), whereas in reality there are no effects. Although a bit conservative, all other procedures did control the study-wise error rate at significance level a, despite the highly correlated tests. Least conservative are the BH and adaptive procedures. With the adaptive procedure, 70.9% of the tests were rejected in scenarios where one of the haplotypes carried a disease mutation. This was even better than the 66% rejected tests when the significant level was not adjusted. The explanation is that, if there is a real effect, it will show in multiple tests because the same mutation can influence multiple single-marker and/or haplotype tests. Indeed, the estimated proportion of tests (1 p0) with real effects Correspondence: Dr EJCG van den Oord, Virginia Institute for Psychiatric and Behavioral Genetics, Medical College of Virginia, Virginia Commonwealth University, P.O. Box 980126, Richmond VA 23298-0126, USA. E-mail: ejvandenoord@vcu.edu Molecular Psychiatry (2005) 10, 230–231 & 2005 Nature Publishing Group All rights reserved 1359-4184/05 $30.00
[1]
Patrick F Sullivan,et al.
False discoveries and models for gene discovery.
,
2003,
Trends in genetics : TIG.
[2]
Guido Bacciagaluppi,et al.
Dynamics for Modal Interpretations
,
1999
.
[3]
R. Hudson.
Properties of a neutral allele model with intragenic recombination.
,
1983,
Theoretical population biology.
[4]
Y. Benjamini,et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
,
1995
.
[5]
John D. Storey.
A direct approach to false discovery rates
,
2002
.
[6]
Y. Benjamini,et al.
On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics
,
2000
.