Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix

Controlled experiments are widely regarded as the most scientific way to establish a true causal relationship between product changes and their impact on business metrics. Many technology companies rely on such experiments as their main data-driven decision-making tool. The sensitivity of a controlled experiment refers to its ability to detect differences in business metrics due to product changes. At Netflix, with tens of millions of users, increasing the sensitivity of controlled experiments is critical as failure to detect a small effect, either positive or negative, can have a substantial revenue impact. This paper focuses on methods to increase sensitivity by reducing the sampling variance of business metrics. We define Netflix business metrics and share context around the critical need for improved sensitivity. We review popular variance reduction techniques that are broadly applicable to any type of controlled experiment and metric. We describe an innovative implementation of stratified sampling at Netflix where users are assigned to experiments in real time and discuss some surprising challenges with the implementation. We conduct case studies to compare these variance reduction techniques on a few Netflix datasets. Based on the empirical results, we recommend to use post-assignment variance reduction techniques such as post stratification and CUPED instead of at-assignment variance reduction techniques such as stratified sampling in large-scale controlled experiments.