Online search systems that display ads continually offer new features that advertisers can use to fine-tune and enhance their ad campaigns. An important question is whether a new feature actually helps advertisers. In an ideal world for statisticians, we would answer this question by running a statistically designed experiment. But that would require randomly choosing a set of advertisers and forcing them to use the feature, which is not realistic. Accordingly, in the real world, new features for advertisers are seldom evaluated with a traditional experimental protocol. Instead, customer service representatives select advertisers who are invited to be among the first to test a new feature (i.e., white-listed), and then each white-listed advertiser chooses whether or not to use the new feature. Neither the customer service representative nor the advertiser chooses at random.
This paper addresses the problem of drawing valid inferences from whitelist trials about the effects of new features on advertiser happiness. We are guided by three principles. First, statistical procedures for whitelist trials are likely to be applied in an automated way, so they should be robust to violations of modeling assumptions. Second, standard analysis tools should be preferred over custom-built ones, both for clarity and for robustness. Standard tools have withstood the test of time and have been thoroughly debugged. Finally, it should be easy to compute reliable confidence intervals for the estimator. We review an estimator that has all these attributes, allowing us to make valid inferences about the effects of a new feature on advertiser happiness. In the example in this paper, the new feature was introduced during the holiday shopping season, thereby further complicating the analysis.
[1]
J. Robins,et al.
Semiparametric Efficiency in Multivariate Regression Models with Missing Data
,
1995
.
[2]
D. Rubin.
Estimating causal effects of treatments in randomized and nonrandomized studies.
,
1974
.
[3]
J. Robins,et al.
Analysis of semiparametric regression models for repeated outcomes in the presence of missing data
,
1995
.
[4]
Charles Elkan,et al.
A Bayesian network framework for reject inference
,
2004,
KDD.
[5]
O. Ashenfelter,et al.
Estimating the Effect of Training Programs on Earnings
,
1976
.
[6]
J. Pearl.
Causality: Models, Reasoning and Inference
,
2000
.
[7]
Orley Ashenfelter,et al.
Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs
,
1984
.
[8]
R. A. Fisher,et al.
Design of Experiments
,
1936
.
[9]
J. Lunceford,et al.
Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study
,
2004,
Statistics in medicine.
[10]
D. Horvitz,et al.
A Generalization of Sampling Without Replacement from a Finite Universe
,
1952
.
[11]
D. Rubin,et al.
The central role of the propensity score in observational studies for causal effects
,
1983
.
[12]
D. McCaffrey,et al.
Propensity score estimation with boosted regression for evaluating causal effects in observational studies.
,
2004,
Psychological methods.