There is growing concern in the scientific community that many published scientific findings may represent spurious patterns that are not reproducible in independent data sets. A reason for this is that significance levels or confidence intervals are often applied to secondary variables or sub-samples within the trial, in addition to the primary hypotheses (multiple hypotheses). This problem is likely to be extensive for population-based surveys, in which epidemiological hypotheses are derived after seeing the data set (hypothesis fishing). We recommend a data-splitting procedure to counteract this methodological problem, in which one part of the data set is used for identifying hypotheses, and the other is used for hypothesis testing. The procedure is similar to two-stage analysis of microarray data. We illustrate the process using a real data set related to predictors of low back pain at 14-year follow-up in a population initially free of low back pain. “Widespreadness” of pain (pain reported in several other places than the low back) was a statistically significant predictor, while smoking was not, despite its strong association with low back pain in the first half of the data set. We argue that the application of data splitting, in which an independent party handles the data set, will achieve for epidemiological surveys what pre-registration has done for clinical studies.
[1]
B. Jonsson,et al.
Standardised Nordic questionnaires for the analysis of musculoskeletal symptoms.
,
1987,
Applied ergonomics.
[2]
D. Madigan,et al.
Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window
,
1994
.
[3]
Julian J. Faraway,et al.
Data Splitting Strategies for Reducing the Effect of Model Selection on Inference
,
1995
.
[4]
Y. Benjamini,et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
,
1995
.
[5]
C. Begg,et al.
Two‐Stage Designs for Gene–Disease Association Studies
,
2002,
Biometrics.
[6]
Eric R. Ziegel,et al.
The Elements of Statistical Learning
,
2003,
Technometrics.
[7]
J. Ioannidis.
Why Most Published Research Findings Are False
,
2005,
PLoS medicine.
[8]
Neil Salkind.
Encyclopedia of Measurement and Statistics
,
2006
.