On principles for modeling propensity scores in medical research

It is clearly important to document how a proposed statistical methodology is actually used in practice if that practice is to be improved. This target article by Weitzen et al., reviewing the way propensity score methods are used in current medical research, therefore, is an important contribution, and I am delighted to have been invited by the editorial board to discuss it. I am a firm believer in the utility of propensity scores for application in observational studies for causal effects, not as a panacea for the deficiencies of observational studies, but as a critical tool contributing to their appropriate design. The target article reveals that many published articles employing propensity scores for medical research are not taking full advantage of the technology, and some may even be misusing it. A possible reason, as indicated by the authors, may be confusion between two kinds of statistical diagnostics: (i) diagnostics for the successful prediction of probabilities and parameter estimates underlying those probabilities, possibly estimated using logistic regression; and (ii) diagnostics for the successful design of observational studies based on estimated propensity scores, possibly estimated using logistic regression. There is no doubt in my mind that (ii) is a critically important activity in most observational studies, whereas I am doubtful about the importance of (i) in most of these. At the outset, it is essential to realize that observational studies should be designed in analogy with the way randomized experiments are designed. This is a theme that can be traced to classical work in observational studies, and a theme I have recently emphasized in the context of the tobacco litigation. When we design a randomized experiment, we cannot try one randomization and see the answer, then try another randomization and see the answer, and continue until we find an answer that is ‘satisfactory’ for publication. Randomized experiments are designed blind to the answer, and this is one of the most important features of randomized experiments. It is a feature that can also be shared with observational studies, although sadly, observational studies are often not conducted this way. Randomized experiments are designed to have balance between treatment and control groups, often within blocks (i.e. within strata, subclasses or matched pairs) on all covariates. Blocking assures balance on the observed covariates used to create the blocks, and randomization implies balance (at least on average) on all other covariates, both observed and unobserved. Due to the absence of randomization in observational studies, we cannot force balance on unobserved covariates, but we must attempt to balance the observed ones (at least on average), and propensity score technology, often combined with blocking on especially important covariates, is an important tool for achieving this balance in observed covariates. In this sense, propensity score technology is the observational study analog of randomization in experiments; randomization is superior in a critical way, however, because it achieves this average balance on all covariates, both observed and unobserved, whereas propensity score methods only operate on observed covariates. If this balance is achieved in an observational study—that is, if the treatment and control groups have very similar distributions of the observed covariates within blocks (subclasses, matched pairs etc.) of the propensity score (perhaps crossed by blocks on critical covariates)—then it really makes no difference, for estimation of effects controlling for these covariates, as to how this balance was achieved. Within blocks balanced on propensity scores, future modelbased adjustments for distributional differences between treatment and control groups (e.g. using linear covariance, relative risks, proportional hazards) will typically have only minor effects on point estimates, although they can have important effects on estimated precisions, and therefore, on interval estimates.