Can I Take a Peek?: Continuous Monitoring of Online A/B Tests

A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals. Despite their ubiquity, these reported values are computed under the assumption that the experimenter will not continuously monitor their test--in other words, there should be no repeated "peeking" at the results that affects the decision of whether to continue the test. On the other hand, one of the greatest benefits of advances in information technology, computational power, and visualization is precisely the fact that experimenters can watch experiments in progress, with greater granularity and insight over time than ever before. We ask the question: if users will continuously monitor experiments, then what statistical methodology is appropriate for hypothesis testing, significance, and confidence intervals? We present recent work addressing this question. In particular, building from results in sequential hypothesis testing, we present analogues of classical frequentist statistical measures that are valid even though users are continuously monitoring the results.