Accounting for Complex Sample Designs in Multiple Imputation Using the Finite Population Bayesian Bootstrap.

Existing fully parametric multiple imputation (MI) techniques typically assume simple random sample designs, relying on use of design-based estimators at the analysis stage to account for design effects. More complex methods which include the sample design in the formulation of the imputation model typically require strong model assumptions and expensive modeling and computation, and may still fail to be fully “congenial”. This dissertation develops an innovative multiple imputation framework (termed “two-step MI”) that accounts for complex sample designs through synthetic data generation and simple parametric models for imputation of missing values. We term this approach “two-step MI”, since we first generate posterior predictive distributions of the population that includes the (item-level) missing data, then fill in the item-level missing data using standard parametric MI techniques. The first paper outlines the conceptual framework and develops a modified set of the standard MI combining rules for inference based on the new method. The focus is on the role of survey weights in conjunction with MI to adjust for item nonresponse. The new procedure uses a weighted finite population Bayesian bootstrap that generates posterior predictive distributions of the finite population that are free of complex design features. As a result, analysts need only to apply simple unweighted estimation methods to the imputed datasets, and, depending on the missingness mechanism, can ignore the sample design in the imputation procedure. Our simulation assuming a PPS sampling design shows that the proposed method achieves good frequentist properties in contrast to many alternative standard approaches that are not robust to model misspecification. The second paper extends the two-step MI to two-stage cluster sample design settings and develops two variations of the proposed procedure for simultaneously resolving clustering effects and “inversing” survey weights. Their performances are evaluated under a variety of simulation conditions in comparison with existing MI techniques. The third paper develops a general methodology to account for stratification effects in a highly stratified design. Quantile estimation and binary rare events data are investigated as well. Extensive analyses are conducted for real data applications on the Behavior Risk Factor Surveillance System (BRFSS), the Delta-V measure from NASS-CDS crash record data and Body Mass Index data from NHANES III.

[1]  Michael P. Cohen 1997: THE BAYESIAN BOOTSTRAP AND MULTIPLE IMPUTATION FOR UNEQUAL PROBABILITY SAMPLE DESIGNS , 2002 .

[2]  D. V. Lindley,et al.  An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[3]  Murray Aitkin,et al.  Variance Component Models with Binary Response: Interviewer Variability , 1985 .

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[6]  Frank E. Grubbs,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[7]  Roderick J. A. Little,et al.  Proxy Pattern-Mixture Analysis for Survey Nonresponse , 2011 .

[8]  T. Raghunathan,et al.  Multiple Imputation of Missing Income Data in the National Health Interview Survey , 2006 .

[9]  Richard Valliant,et al.  Finite population sampling and inference : a prediction approach , 2000 .

[10]  Graham Kalton,et al.  Ultimate Cluster Sampling , 1979 .

[11]  Hakan Demirtas,et al.  Impact of non-normal random effects on inference by multiple imputation: A simulation assessment , 2010, Comput. Stat. Data Anal..

[12]  Richard Valliant,et al.  Poststratification and Conditional Variance Estimation , 1993 .

[13]  D. Pfeffermann The Role of Sampling Weights when Modeling Survey Data , 1993 .

[14]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[15]  R. Little Models for Nonresponse in Sample Surveys , 1982 .

[16]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[17]  Recai M. Yucel,et al.  Performance of Sequential Imputation Method in Multilevel Applications , 2009 .

[18]  Ying Yuan,et al.  Model‐based estimates of the finite population mean for two‐stage cluster samples with unit non‐response , 2007 .

[19]  R. Little To Model or Not To Model? Competing Modes of Inference for Finite Population Sampling , 2004 .

[20]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[21]  R. Royall On finite population sampling theory under certain linear regression models , 1970 .

[22]  Andrew Gelman,et al.  Struggles with survey weighting and regression modeling , 2007, 0710.5005.

[23]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[24]  Michael R Elliott,et al.  Multiple Imputation in Two-Stage Cluster Samples Using The Weighted Finite Population Bayesian Bootstrap. , 2016, Journal of survey statistics and methodology.

[25]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[26]  Recai M Yucel,et al.  Random covariances and mixed-effects models for imputing multivariate multilevel continuous data , 2011, Statistical modelling.

[27]  Michael R Elliott,et al.  A nonparametric method to generate synthetic populations to adjust for complex sampling design features. , 2014, Survey methodology.

[28]  Francis Hsuan A Stepwise Bayesian Procedure , 1979 .

[29]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[30]  Stephen B. Vardeman,et al.  A noninformative Bayesian approach to interval estimation in finite population sampling , 1991 .

[31]  Malay Ghosh,et al.  Bayesian Methods for Finite Population Sampling , 1997 .

[32]  Danny Pfeffermann,et al.  Modelling of complex survey data: Why model? Why is it a problem? How can we approach it? , 2011 .

[33]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[34]  Ying Yuan,et al.  Parametric and Semiparametric Model‐Based Estimates of the Finite Population Mean for Two‐Stage Cluster Samples with Item Nonresponse , 2007, Biometrics.

[35]  J. Rao Small Area Estimation , 2003 .

[36]  P J McCarthy,et al.  The bootstrap and finite population sampling. , 1985, Vital and health statistics. Series 2, Data evaluation and methods research.

[37]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[38]  Glen D Meeden,et al.  A noninformative Bayesian approach to finite population sampling using auxiliary variables , 2008 .

[39]  Morris H. Hansen,et al.  Sample survey methods and theory , 1955 .

[40]  D. Binder On the variances of asymptotically normal estimators from complex surveys , 1983 .

[41]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[42]  Ingram Olkin,et al.  Multivariate Correlation Models with Mixed Discrete and Continuous Variables , 1961 .

[43]  Roderick J. A. Little,et al.  Non‐response adjustment of survey estimates based on auxiliary variables subject to error , 2013 .

[44]  Stanislav Kolenikov,et al.  Resampling Inference with Complex Survey Data , 1996 .

[45]  Andrew J Copas,et al.  Combining Multiple Imputation and Inverse-Probability Weighting , 2012, Biometrics.

[46]  Glen D Meeden,et al.  Noninformative nonparametric quantile estimation for simple random samples , 2006 .

[47]  Harvey Goldstein,et al.  REALCOM-IMPUTE Software for Multilevel Multiple Imputation with Mixed Response Types , 2011 .

[48]  Rebecca R Andridge,et al.  Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials , 2011, Biometrical journal. Biometrische Zeitschrift.

[49]  R. Little,et al.  Inference for the Population Total from Probability-Proportional-to-Size Samples Based on Predictions from a Penalized Spline Nonparametric Model , 2003 .

[50]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[51]  D. Rubin The Bayesian Bootstrap , 1981 .

[52]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[53]  E. Korn,et al.  Analysis of Health Surveys: Korn/Analysis , 1999 .

[54]  Roderick Little,et al.  Calibrated Bayes, for Statistics in General, and Missing Data in Particular , 2011, 1108.1917.

[55]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[56]  R. Little,et al.  Model-Based Alternatives to Trimming Survey Weights , 2000 .

[57]  Morris H. Hansen,et al.  An Evaluation of Model-Dependent and Probability-Sampling Inferences in Sample Surveys: Rejoinder , 1983 .

[58]  J. Ware,et al.  Random-effects models for serial observations with binary response. , 1984, Biometrics.

[59]  Wayne A. Fuller,et al.  On the bias of the multiple‐imputation variance estimator in survey sampling , 2006 .

[60]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[61]  Murray Aitkin,et al.  Applications of the Bayesian bootstrap in finite population inference , 2008 .

[62]  J. N. K. Rao,et al.  Bootstrap and other methods to measure errors in survey estimates , 1988 .

[63]  Model Averaging Methods for Weight Trimming in Generalized Linear Regression Models. , 2009, Journal of official statistics.

[64]  Ewout W Steyerberg,et al.  Logistic random effects regression models: a comparison of statistical packages for binary and ordinal outcomes , 2011, BMC medical research methodology.

[65]  Sahar Z ZangenehRobert,et al.  Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units , 2011 .

[66]  Malay Ghosh,et al.  Small Area Estimation: An Appraisal , 1994 .

[67]  S. Rabe-Hesketh,et al.  Multilevel modelling of complex survey data , 2006 .

[68]  Michael R Elliott,et al.  A two‐step semiparametric method to accommodate sampling weights in multiple imputation , 2016, Biometrics.

[69]  Albert Y. Lo Bayesian Statistical Inference for Sampling a Finite Population , 1986 .

[70]  Roderick J. A. Little,et al.  Multiple Imputation for the Fatal Accident Reporting System , 1991 .

[71]  Richard Valliant,et al.  The effect of multiple weighting steps on variance estimation , 2004 .

[72]  R. Little,et al.  Penalized Spline Nonparametric Mixed Models for Inference About a Finite Population Mean from Two-Stage Samples , 2003 .

[73]  Stanley Lemeshow The use of unique statistical weights for estimating variances with the balanced half-sample technique , 1979 .

[74]  W. Fuller,et al.  Quantile Estimation with a Complex Survey Design , 1991 .

[75]  J. Schafer Imputation of missing covariates under a multivariate linear mixed model , 2005 .

[76]  Claude Girard,et al.  The Rao-Wu Rescaling Bootstrap : From theory to practice , 2009 .

[77]  Glen D Meeden,et al.  A NONINFORMATIVE BAYESIAN APPROACH FOR TWO-STAGE CLUSTER SAMPLING* , 1998 .

[78]  Roderick J Little,et al.  The Use of Sample Weights in Hot Deck Imputation. , 2009, Journal of official statistics.

[79]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[80]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[81]  J. Schafer,et al.  Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values , 2002 .

[82]  Brady T West,et al.  An Overview of Current Software Procedures for Fitting Linear Mixed Models , 2011, The American statistician.

[83]  Michael R Elliott,et al.  Appropriate analysis of CIREN data: using NASS-CDS to reduce bias in estimation of injury risk factors in passenger vehicle crashes. , 2010, Accident; analysis and prevention.

[84]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[85]  D. Bates,et al.  Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model , 1995 .

[86]  Jun Shao,et al.  Asymptotic Properties of the Balanced Repeated Replication Method for Sample Quantiles , 1992 .

[87]  M. Elliott Model Averaging Methods for Weight Trimming. , 2008, Journal of official statistics.

[88]  S. Gross MEDIAN ESTIMATION IN SAMPLE SURVEYS , 2002 .

[89]  R. Little,et al.  Does Weighting for Nonresponse Increase the Variance of Survey Means? (Conference Paper) , 2004 .

[90]  Kosuke Imai,et al.  Survey Sampling , 1998, Nov/Dec 2017.

[91]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[92]  Albert Y. Lo,et al.  A Bayesian bootstrap for a finite population , 1988 .

[93]  Danny Pfeffermann,et al.  Fitting Generalized Linear Models under Informative Sampling , 2003 .

[94]  R. Sugden,et al.  Ignorable and informative designs in survey sampling inference , 1984 .

[95]  R. Little Calibrated Bayes , 2006 .

[96]  Jerome P. Reiter,et al.  The importance of modeling the sampling design in multiple imputation for missing data , 2006 .

[97]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[98]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[99]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[100]  Yinzhong Chen,et al.  BOOTSTRAPPING SAMPLE QUANTILES BASED ON COMPLEX SURVEY DATA UNDER HOT DECK IMPUTATION , 1998 .

[101]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[102]  Roderick J A Little,et al.  A Review of Hot Deck Imputation for Survey Non‐response , 2010, International statistical review = Revue internationale de statistique.

[103]  K. Wolter Introduction to Variance Estimation , 1985 .

[104]  W. A. Ericson Subjective Bayesian Models in Sampling Finite Populations , 1969 .

[105]  F. Breidt,et al.  Model-Assisted Estimation for Complex Surveys Using Penalized Splines , 2005 .

[106]  R. Little,et al.  Bayesian penalized spline model-based inference for finite population proportion in unequal probability sampling. , 2010, Survey methodology.

[107]  Padraic Murphy,et al.  An Overview of Primary Sampling Units (PSUs) in Multi-Stage Samples for Demographic Surveys 1 , 2008 .