Fishing Expedition Probability: The Statistics of Post Hoc Hypothesizing

The advent of the computer and mechanical data processing has greatly expanded the volume of relationships the social scientist may now test with his data. In the pre-machine age of yesteryear, it would take all morning to sort out one cross-break or to calculate one correlation coefficient. Now, computers will churn out scores of cross-breaks or correlations in a twinkling of an eye. Beneficial as such an expanded volume of output may be, it also requires certain readjustments in our research perspectives. In the old days a researcher did not compute a correlation between two variables unless he had a good hunch that there would or would not be a relationship. That is, he pretty much had to have an hypothesis a priori. The old-fashioned techniques compelled the researcher, in the best tradition of textbook scientific method, to place his bets before he ran the test. He had to bet a morning or a day of his time that a relationship existed (or that a relationship proposed by someone else did not exist). With computers, the situation changes entirely. Once data are fed in, there is virtually no restraint, no sacrifice in exploring dozens of hypotheses in addition to the one the researcher was initially curious about. Indeed, the researcher would often have to go out of his way to prevent the machine from pumping out many additional computations. As a result, we now end up "testing" scores of hypotheses in a post hoc fashion. We place no bets beforehand; we simply scan the print-out and then discover relationships. This approach is not, as I shall point out later, as pernicious as it perhaps sounds; but it does compel an elementary readjustment, summed up in the following observation: The meaning of tests of statistical significance in exploratory research is radically altered when many relationships are simultaneously examined. Two hypothetical, but quite typical, examples served to illustrate the problem.