Efficient estimation of regression parameters from multistage studies with validation of outcome and covariates

Abstract Often the variables in a regression model are difficult or expensive to obtain so auxiliary variables are collected in a preliminary step of a study and the model variables are measured at later stages on only a subsample of the study participants called the validation sample. We consider a study in which at the first stage some variables, throughout called auxiliaries, are collected; at the second stage the true outcome is measured on a subsample of the first-stage sample, and at the third stage the true covariates are collected on a subset of the second-stage sample. In order to increase efficiency, the probabilities of selection into the second and third-stage samples are allowed to depend on the data observed at the previous stages. In this paper we describe a class of inverse-probability-of-selection-weighted semiparametric estimators for the parameters of the model for the conditional mean of the outcomes given the covariates. We assume that a subject's probability of being sampled at subsequent stages is bounded away from zero and depends only on the subject's data collected at the previous sampling stages. We show that the asymptotic variance of the optimal estimator in our class is equal to the semiparametric variance bound for the model. Since the optimal estimator depends on unknown population parameters it is not available for data analysis. We therefore propose an adaptive estimation procedure for locally efficient inferences. A simulation study is carried out to study the finite sample properties of the proposed estimators.

[1]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[2]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[3]  Margaret S. Pepe,et al.  Inference using surrogate outcome data and a validation sample , 1992 .

[4]  Guido W. Imbens,et al.  An efficient method of moments estimator for discrete choice models with choice-based sampling , 1992 .

[5]  L P Zhao,et al.  Designs and analysis of two-stage studies. , 1992, Statistics in medicine.

[6]  Charles F. Manski,et al.  Alternative Estimators and Sample Designs for Discrete Choice Analysis , 1981 .

[7]  W. J. Hall,et al.  Information and Asymptotic Efficiency in Parametric-Nonparametric Models , 1983 .

[8]  S Greenland,et al.  Analytic methods for two-stage case-control studies and other stratified designs. , 1991, Statistics in medicine.

[9]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[10]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[11]  G. Chamberlain Asymptotic efficiency in estimation with conditional moment restrictions , 1987 .

[12]  J F Lawless,et al.  Likelihood analysis of multi-state models for disease incidence and mortality. , 1988, Statistics in medicine.

[13]  Thomas R. Fleming,et al.  A Nonparametric Method for Dealing with Mismeasured Covariate Data , 1991 .

[14]  Tor D. Tosteson,et al.  Designing a logistic regression study using surrogate measures for exposure and outcome , 1990 .

[15]  J. Robins,et al.  Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[16]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[17]  Wagner A. Kamakura,et al.  Book Review: Structural Analysis of Discrete Data with Econometric Applications , 1982 .

[18]  Thomas R. Fleming,et al.  Auxiliary outcome data and the mean score method , 1994 .

[19]  Margaret S. Pepe,et al.  A mean score method for missing and auxiliary covariate data in regression models , 1995 .