Response to comment on 'empirical assessment of methods for risk identification in healthcare data'.

The secondary use of observational healthcare data has the potential to support the characterization of causal associations between medical product exposures and subsequent health outcomes of interest. These data are often used in pharmacoepidemiology studies to estimate the strength of temporal association as an average treatment effect. The method used within a pharmacoepidemiology study can be considered a ‘measurement device’. As with any measurement device, it is critical to first understand its operating characteristics (how well it works and whether it is properly calibrated for the objective at hand) before deploying it. This understanding should be considered a prerequisite in identifying whether the data and methods are used to study prespecified hypotheses concerning a single drug–outcome pair or if more systematically applied to multiple drug–outcome pairs as in the proactive identification of potential associations. In this regard, we see the analytical challenge of ‘signal generation’ and ‘signal refinement’ as the same and the need to establish operating characteristics as highly applicable to both use cases that Gagne and Schneeweiss articulated [1]. Gagne and Schneeweiss [1] focused on the use of the area under the curve (AUC) of the ROC curve as a quantitative method for characterizing epidemiological analysis methods. Indeed, there are various metrics for measuring the operating characteristics of analysis methods, each of which can contribute to and complement our understanding of the methods. The critical notion, however, is that our understanding should be based on empirical assessments of operating characteristics, which can be obtained by using a set of test cases, assuming exchangeability among the test cases. These test cases can be examined in real and simulated data, assessing the exchangeability assumption through stratification of test cases (e.g., partitioning by outcome). Once these empirical experiments have been conducted, each of the performance metrics can be computed in a straightforward manner. The AUC measures the ability of an analytic approach to discriminate between positive and negative controls. AUC depends only on the rank order of the measure being assessed (estimated effect sizes in our case) for the test cases. To calculate AUC, we do not need to know the true effect size of each test case; we only have to assume that the effect size for positive controls is greater than for negative controls. In this regard, it is a conservative metric, in that methods that fail to discriminate positive effects from noneffects should be expected to be even less likely to be able to discriminate between different non-null effect sizes. Imperfect discrimination means one or more effect estimates for negative controls are larger than one or more effect estimates for positive controls. Understanding how often this lack of discrimination can occur should be regarded as a valuable construct when considering how much confidence to place on any one result. The observation that no method achieved an AUC above 0.77 in this experiment [2] suggests that a single estimate from any of these methods has the potential to be misleading when used to determine if a drug–outcome relationship represents a positive association. Accurately determining the magnitude of the association becomes a second-order issue if the existence or absence of an effect cannot be confidently established. Although an analytical method may discriminate negative from positive controls well on the basis of inaccurate relative risks (RRs), it is not possible for a method to produce accurate RRs without accurately discriminating negative from positive controls. As Gagne and Schneeweiss pointed out, analytical methods may produce biased results in observational healthcare data, and the magnitude of bias may vary across methods. Bias is the expected value of the error distribution, where error represents the difference between the estimated effect and the true effect. A key challenge in estimating bias in pharmacoepidemiology use cases is that the true effect size of a drug–outcome relationship is unknown. Here, when evaluating the results of real data experiments, we can assume RRtrue D 1 for the negative controls, although the validity of that assumption cannot be assessed. The true effect size for positive controls cannot be readily established; even in situations