Improved Horvitz–Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology

The case-cohort study involves two-phase samplings: simple random sampling from an infinite superpopulation at phase one and stratified random sampling from a finite cohort at phase two. Standard analyses of case-cohort data involve solution of inverse probability weighted (IPW) estimating equations, with weights determined by the known phase two sampling fractions. The variance of parameter estimates in (semi)parametric models, including the Cox model, is the sum of two terms: (i) the model-based variance of the usual estimates that would be calculated if full data were available for the entire cohort; and (ii) the design-based variance from IPW estimation of the unknown cohort total of the efficient influence function (IF) contributions. This second variance component may be reduced by adjusting the sampling weights, either by calibration to known cohort totals of auxiliary variables correlated with the IF contributions or by their estimation using these same auxiliary variables. Both adjustment methods are implemented in the R survey package. We derive the limit laws of coefficients estimated using adjusted weights. The asymptotic results suggest practical methods for construction of auxiliary variables that are evaluated by simulation of case-cohort samples from the National Wilms Tumor Study and by log-linear modeling of case-cohort data from the Atherosclerosis Risk in Communities Study. Although not semiparametric efficient, estimators based on adjusted weights may come close to achieving full efficiency within the class of augmented IPW estimators.

[1]  Donglin Zeng,et al.  Maximum likelihood estimation in semiparametric regression models with censored data , 2007, Statistica Sinica.

[2]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[3]  Aric Invest The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators , 1989 .

[4]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[5]  Heejung Bang,et al.  Lipoprotein-Associated Phospholipase A2, High-Sensitivity C-Reactive Protein, and Risk for Incident Coronary Heart Disease in Middle-Aged Men and Women in the Atherosclerosis Risk in Communities (ARIC) Study , 2004, Circulation.

[6]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[7]  C Y Wang,et al.  Augmented Inverse Probability Weighted Estimator for Cox Missing Covariate Regression , 2001, Biometrics.

[8]  Thomas Lumley,et al.  Using the whole cohort in the analysis of case-cohort data. , 2009, American journal of epidemiology.

[9]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[10]  N E Breslow,et al.  Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms' tumor: a report from the National Wilms' Tumor Study Group. , 1998, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[11]  Jon A Wellner,et al.  A Z-theorem with Estimated Nuisance Parameters and Correction Note for 'Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression' , 2008, Scandinavian journal of statistics, theory and applications.

[12]  Bin Nan,et al.  Efficient estimation for case‐cohort studies , 2004 .

[13]  Michal Kulich,et al.  Improving the Efficiency of Relative-Risk Estimation in Case-Cohort Studies , 2004 .

[14]  C. Särndal,et al.  Calibration Estimators in Survey Sampling , 1992 .

[15]  Hormuzd A. Katki,et al.  Specifying and Implementing Nonparametric and Semiparametric Survival Estimators in Two-Stage (Nested) Cohort Studies With Missing Case Data , 2006 .

[16]  L. J. Wei,et al.  The Robust Inference for the Cox Proportional Hazards Model , 1989 .

[17]  J E White,et al.  A two stage design for the study of the relationship between a rare exposure and a rare disease. , 1982, American journal of epidemiology.

[18]  A. Folsom,et al.  The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. , 1989, American journal of epidemiology.

[19]  G. Berglund,et al.  The epidemiology of Lp-PLA(2): distribution and correlation with cardiovascular risk factors in a population-based cohort. , 2007, Atherosclerosis.

[20]  Carl-Erik Särndal,et al.  The weighted residual technique for estimating the variance of the general regression estimator of the finite population total , 1989 .

[21]  Bryan Langholz,et al.  Exposure Stratified Case-Cohort Designs , 2000, Lifetime data analysis.

[22]  Torben Martinussen,et al.  Maximum Likelihood Estimation for Cox's Regression Model Under Case–Cohort Sampling , 2004 .

[23]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .

[24]  Norman E. Breslow,et al.  A Z‐theorem with Estimated Nuisance Parameters and Correction Note for ‘Weighted Likelihood for Semiparametric Models and Two‐phase Stratified Samples, with Application to Cox Regression’ , 2008 .

[25]  Edward Baum,et al.  Treatment of Wilms' tumor. Results of the third national Wilms' tumor study , 1989, Cancer.

[26]  A. Scott,et al.  Fitting regression models to case-control data by maximum likelihood , 1997 .

[27]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[28]  Danyu Lin,et al.  On fitting Cox's proportional hazards models to survey data , 2000 .

[29]  W. J. Hall,et al.  Information and Asymptotic Efficiency in Parametric-Nonparametric Models , 1983 .

[30]  C. T. Isaki,et al.  Survey Design under the Regression Superpopulation Model , 1982 .

[31]  H. Krumholz Lipoprotein-Associated Phospholipase A , 2001 .

[32]  N. Lange,et al.  Approximate case influence for the proportional hazards regression model with censored data. , 1984, Biometrics.

[33]  N. Breslow Covariance analysis of censored survival data. , 1974, Biometrics.

[34]  David A. Binder,et al.  Fitting Cox's proportional hazards models from survey data , 1992 .

[35]  J. Neyman Contribution to the Theory of Sampling Human Populations , 1938 .

[36]  W. Barlow,et al.  Robust variance estimation for the case-cohort design. , 1994, Biometrics.

[37]  R. L. Prentice,et al.  A case-cohort design for epidemiologic cohort studies and disease prevention trials , 1986 .

[38]  W E Barlow,et al.  Analysis of case-cohort designs. , 1999, Journal of clinical epidemiology.

[39]  P. Grambsch,et al.  Modeling Survival Data: Extending the Cox Model , 2000 .

[40]  A. V. D. Vaart,et al.  Asymptotic Statistics: U -Statistics , 1998 .

[41]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[42]  S. Rai,et al.  LOG-LINEAR MODELLING OF CHANGE USING LONGITUDINAL SURVEY DATA , 2002 .

[43]  Susana Rubin-Bleuer,et al.  On the two-phase framework for joint model and design-based inference , 2005, math/0603078.

[44]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .