In clinical survival studies conducted in the United States, rich data are frequently available on variables recording time to drop out and the evolution over time of a patient's clinical signs, symptoms, and laboratory measurements. Motivated by a Greek study, Frangakis and Rubin (FR) consider estimation of a marginal survival curve under a double sampling design from severely impoverished reduced data that included none of the aforementioned variables due to confidentiality restrictions. To make our discussion relevant to settings with and without confidentiality restrictions, we shall consider survival curve estimation both from rich data that includes the aforementioned variables and from reduced data that does not. Let T, L, and C be continuous failure, dropout, and administrative censoring times, respectively, with time measured from date of enrollment. The goal is to estimate the cumulative net (marginal) hazard of failure A(t) = fJ AT(u)du and the survival function S(t) = e-A(t) under a double sampling design in which a subset of the dropouts is followed up in a secondphase sample. In their analysis, FR assumed (a) all dropouts had the same chance of being selected into the second-stage sample, (b) all dropouts selected into the second phase had their censoring indicator A\6(T < C) and their minimum X = min(T, C) of censoring and failure successfully ascertained, and (c) C was independent of underlying variables such as T and L. It was necessary for FR to impose these assumptions because FR's estimator of A(t) is inconsistent unless (a)-(c) hold. In practice, one or more of assumptions (a)-(c) may often fail to hold. For instance, in the first paragraph of their Section 5, FR noted it will often happen that a subset of the dropouts pursued in the second phase will fail to have (X, i\) ascertained, violating assumption (b). FR recommend that members of this subset be treated in the analysis as having been administratively censored. However, were this recommendation to be followed, FR's survival estimator would be inconsistent and could be severely biased if the number of secondstage subjects who do not have (X, i\) ascertained is large. Assumption (a) will be false if a potentially more efficient design has been employed in which subjects who dropped out early are oversampled in the second phase. Assumption (c) will be false when there are secular trends in the distribution of T, as was the case during the 1980s and 1990s for the survival time T of AIDS patients. If only assumption (a) were false, then, as FR note, identification of A(t) could be restored by addition of the known second-phase sampling probabilities to FR's reduced data. However, when assumptions (b) and/or (c) are false, both additional assumptions and rich data are required to restore identification. In this discussion, we make the following points. First, we show in Section 4 that, if assumptions (a)-(c) hold, then (i), when only the reduced data are available for analysis, FR's estimator is algebraically identical to the efficient inverse proW ability weighted (IPW) estimator, but (ii) when rich data are available, FR's estimator is inefficient; for this case, we provide in Section 8 a locally semiparametric efficient (LSE) estimator of S(t) that exploits the information in the rich data. Second, we consider the more realistic setting in which (i) FR's estimator is inconsistent because one or more of assumptions (a)-(c) fail to hold and (ii) rich data are available. In Sections 6 and 7, we derive doubly robust LSE survival curve estimators under the assumption that the missingness process is ignorable. In summary, we describe aspects of a powerful methodology for the analysis of doubly sampled censored survival data that can resolve the three principal problems left open by FR: (i) how to construct locally LSE estimators of S(t) when FR's assumptions (a)-(c) hold and, as would typically be the case in the United States, rich data are available for analysis; (ii) how to construct LSE doubly robust estimators of S(t) when FR's assumptions (b) and/or (c) fail but missingness remains ignorable; and (iii) how to conduct a sensitivity analysis when missingness may be nonignorable. This methodology is in effect a subset of the general theory developed in Robins and Rotnitzky (1992), Robins (1993a,b), Robins, Rotnitzky, and Scharfstein (1999), and Scharfstein, Rotnitzky, and Robins (1999a) for the analysis of semiand nonparametric rightcensored data models specialized to the case of doubly sampled censored survival data. Because of space limitations, our resolution of problem (iii) will be described elsewhere. In order to successfully exploit our general theory in the context of doubly sampled data, an additional problem must be faced. Specifically, in this context, the distribution of the censoring variable has both discrete and continuous components; as a result, none of the estimators of S(t) previously proposed in the aforementioned papers are directly applicable. In Section 6 and the Appendix, we provide a survival estimator that allows for a mixed discrete and continuous censor-
[1]
D. Rubin,et al.
Ignorability and Coarse Data
,
1991
.
[2]
D Scharfstein,et al.
Inference in Randomized Studies with Informative Censoring and Discrete Time‐to‐Event Endpoints
,
2001,
Biometrics.
[3]
James M. Robins,et al.
On Profile Likelihood: Comment
,
2000
.
[4]
J. Robins,et al.
Semiparametric regression estimation in the presence of dependent censoring
,
1995
.
[5]
J. Robins,et al.
Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models
,
1999
.
[6]
J. Robins,et al.
Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers
,
1992
.
[7]
J. Robins,et al.
Inference for imputation estimators
,
2000
.
[8]
James M. Robins,et al.
Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models: Rejoinder
,
1999
.