论文信息 - A statistical methodology for analyzing co-occurrence data from a large sample

A statistical methodology for analyzing co-occurrence data from a large sample

Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the chi2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (alpha=.05, chi2=3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the chi2-statistic. A plot of adjusted p-values (1 - p versus N(p)), where N(p) is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations.

[1] George Hripcsak,et al. Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics , 2005, AMIA.

[2] M. Newton. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[3] B. Efron. Large-Scale Simultaneous Hypothesis Testing , 2004 .

[4] E. Spjøtvoll,et al. Plots of P-values to evaluate many tests simultaneously , 1982 .

[5] W. G. Cochran. Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[6] F. Yates. Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[7] P. Diaconis,et al. Testing for independence in a two-way table , 1985 .