Methods for Stratified Cluster Sampling with Informative Stratification

We look at fitting regression models using data from stratified cluster samples when the strata may depend in some way on the observed responses within clusters. One important subclass of examples is that of family studies in genetic epidemiology, where the probability of selecting a family into the study depends on the incidence of disease within the family. We develop the survey-weighted estimating equation approach for this problem, with particular emphasis on the estimation of superpopulation parameters. Full maximum likelihood for this class of problems involves modelling the population distribution of the covariates which is simply not feasible when there are a large number of potential covariates. We discuss efficient semiparametric maximum likelihood methods in which the covariate distribution is left completely unspecified. We further discuss the relative efficiencies of these two approaches.

[1]  Alastair Scott,et al.  Maximum likelihood for generalised case-control studies , 2001 .

[2]  Jerald F. Lawless,et al.  Semiparametric methods for response‐selective and missing data problems in regression , 1999 .

[3]  D. DeMets,et al.  Estimation of a Simple Regression Coefficient in Samples Arising from a Sub-Sampling Procedure , 1977 .

[4]  A. Scott,et al.  On the robustness of weighted methods for fitting models to case–control data , 2002 .

[5]  R Miike,et al.  Familial and personal medical history of cancer and nervous system conditions among adults with glioma and controls. , 1997, American journal of epidemiology.

[6]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[7]  J M Neuhaus,et al.  The effect of retrospective sampling on binary regression models for clustered data. , 1990, Biometrics.

[8]  D. Binder On the variances of asymptotically normal estimators from complex surveys , 1983 .

[9]  W. Newey,et al.  The asymptotic variance of semiparametric estimators , 1994 .

[10]  James M. Robins,et al.  Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates , 1995 .

[11]  J M Neuhaus,et al.  Family‐Specific Approaches to the Analysis of Case–Control Family Data , 2006, Biometrics.

[12]  S. Amari,et al.  Estimating Functions in Semiparametric Statistical Models , 1997 .

[13]  Alan J. Lee,et al.  Semi-parametric efficiency bounds for regression models under generalised case-control sampling : the profile likelihood approach , 2007 .

[14]  Alastair Scott,et al.  Case–control studies with complex sampling , 2001 .

[15]  Chris J. Skinner,et al.  QUASI-SCORE TESTS WITH SURVEY DATA , 1998 .

[16]  J. N. K. Rao,et al.  ASYMPTOTIC NORMALITY UNDER TWO-PHASE SAMPLING DESIGNS , 2007 .

[17]  Alice S. Whittemore,et al.  Logistic regression of family data from case-control studies , 1995 .

[18]  Stepen Rhys Cosslett,et al.  Efficient estimation of discrete-choice models from choice-based samples , 1978 .

[19]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[20]  Alastair Scott,et al.  The analysis of retrospective family studies , 2002 .

[21]  J. Kadane Structural Analysis of Discrete Data with Econometric Applications , 1984 .

[22]  D. Holt,et al.  The Effect of Survey Design on Regression Analysis , 1980 .

[23]  A. Scott,et al.  Re-using data from case-control studies. , 1997, Statistics in medicine.

[24]  Chris J. Wild,et al.  Fitting prospective regression models to case-control data , 1991 .

[25]  Filemon Quiaoit,et al.  Combined association and aggregation analysis of data from case-control family studies , 1998 .