Meta-analyzing the multiverse: A peek under the hood of selective reporting.

There are arbitrary decisions to be made (i.e., researcher degrees of freedom) in the execution and reporting of most research. These decisions allow for many possible outcomes from a single study. Selective reporting of results from this ‘multiverse’ of outcomes, whether intentional (_p_-hacking) or not, can lead to inflated effect size estimates and false positive results in the literature. In this study, we examine and illustrate the consequences of researcher degrees of freedom in primary research, both for primary outcomes and for subsequent meta-analyses. We used a set of 10 preregistered multi-lab direct replication projects from psychology (Registered Replication Reports) with a total of 14 primary outcome variables, 236 labs and 37,602 participants. By exploiting researcher degrees of freedom in each project, we were able to compute between 3,840 and 2,621,440 outcomes per lab. We show that researcher degrees of freedom in primary research can cause substantial variability in effect size that we denote the Underlying Multiverse Variability (UMV). In our data, the median UMV across labs was 0.1 standard deviations (interquartile range = 0.09 – 0.15). In one extreme case, the effect size estimate could change by _d_ = 1.27, evidence that _p_-hacking in some (rare) cases can provide support for almost any conclusion. We also show that researcher degrees of freedom in primary research provide another source of uncertainty in meta-analysis beyond those usually estimated. This would not be a large concern for meta-analysis if researchers made all arbitrary decisions at random. However, emulating selective reporting of lab results led to inflation of meta-analytic average effect size estimates in our data by as much as 0.1 - 0.48 standard deviations, depending to a large degree on the number of possible outcomes at the lab level (i.e., multiverse size). Our results illustrate the importance of making research decisions transparent (e.g., through preregistration and multiverse analysis), evaluating studies for selective reporting, and whenever feasible making raw data available.