Semiparametric methods for response‐selective and missing data problems in regression

Suppose that data are generated according to the model f(y|x; θ) g(x), where y is a response and x are covariates. We derive and compare semiparametric likelihood and pseudolikelihood methods for estimating θ for situations in which units generated are not fully observed and in which it is impossible or undesirable to model the covariate distribution. The probability that a unit is fully observed may depend on y, and there may be a subset of covariates which is observed only for a subsample of individuals. Our key assumptions are that the probability that a unit has missing data depends only on which of a finite number of strata that (y, x) belongs to and that the stratum membership is observed for every unit. Applications include case–control studies in epidemiology, field reliability studies and broad classes of missing data and measurement error problems. Our results make fully efficient estimation of θ feasible, and they generalize and provide insight into a variety of methods that have been proposed for specific problems.

[1]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .

[2]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[3]  Margaret S. Pepe,et al.  Inference using surrogate outcome data and a validation sample , 1992 .

[4]  J. Anderson Separate sample logistic discrimination , 1972 .

[5]  Kaipillil Vijayan,et al.  Optimal Estimation for Response-Dependent Retrospective Sampling , 1996 .

[6]  A. Scott,et al.  Re-using data from case-control studies. , 1997, Statistics in medicine.

[7]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[8]  A. Scott,et al.  Fitting Logistic Models Under Case‐Control or Choice Based Sampling , 1986 .

[9]  D. Binder On the variances of asymptotically normal estimators from complex surveys , 1983 .

[10]  L P Zhao,et al.  Designs and analysis of two-stage studies. , 1992, Statistics in medicine.

[11]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[12]  Gail Gong,et al.  Pseudo Maximum Likelihood Estimation: Theory and Applications , 1981 .

[13]  D. Holt,et al.  Regression Analysis of Data from Complex Surveys , 1980 .

[14]  A. Scott,et al.  Fitting regression models to case-control data by maximum likelihood , 1997 .

[15]  J. Lawless,et al.  Estimation from truncated lifetime data with supplementary information on covariates and censoring times , 1996 .

[16]  David A. Binder,et al.  Use of Estimating Functions for Estimation from Complex Surveys , 1994 .

[17]  James M. Robins,et al.  Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates , 1995 .

[18]  Dinh Quy Tong,et al.  Maximum Likelihood Estimation for , 1994 .

[19]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[20]  Thomas R. Fleming,et al.  A Nonparametric Method for Dealing with Mismeasured Covariate Data , 1991 .

[21]  Chris J. Wild,et al.  Fitting prospective regression models to case-control data , 1991 .

[22]  J F Lawless,et al.  Likelihood analysis of multi-state models for disease incidence and mortality. , 1988, Statistics in medicine.

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Charles F. Manski,et al.  Estimation of Response Probabilities From Augmented Retrospective Observations , 1985 .

[25]  J. Lawless Likelihood and Pseudo-likelihood Estimation Based on Response-Biased Observation , 1997 .

[26]  Vijayan N. Nair,et al.  Estimation of reliability in field-performance studies , 1988 .

[27]  Margaret S. Pepe,et al.  The relationship between hot-deck multiple imputation and weighted likelihood. , 1997, Statistics in medicine.

[28]  R. Hines Fitting generalized linear models to retrospectively sampled clusters with categorical responses , 1997 .

[29]  J G Ibrahim,et al.  Using the EM-algorithm for survival data with incomplete categorical covariates , 1996, Lifetime data analysis.

[30]  Alice S. Whittemore,et al.  Multistage Sampling Designs and Estimating Equations , 1997 .

[31]  Raymond J. Carroll,et al.  Semiparametric Estimation in Logistic Measurement Error Models , 1989 .

[32]  Jan M. Hoem,et al.  Longitudinal Analysis of Labor Market Data: Weighting, misclassification, and other issues in the analysis of survey samples of life histories , 1985 .

[33]  Nicholas P. Jewell,et al.  Least squares regression with data arising from stratified samples of the dependent variable , 1985 .

[34]  Margaret S. Pepe,et al.  A mean score method for missing and auxiliary covariate data in regression models , 1995 .

[35]  Sven Ove Samuelsen,et al.  A psudolikelihood approach to analysis of nested case-control studies , 1997 .

[36]  D. Ruppert,et al.  Measurement Error in Nonlinear Models , 1995 .

[37]  Juni Palmgren Regression Models for Bivariate Binary Responses , 1989 .