Optimal Integrating Learning for Split Questionnaire Design Type Data

In the era of data science, it is common to encounter data with different subsets of variables obtained for different cases. An example is the split questionnaire design (SQD), which is adopted to reduce respondent fatigue and improve response rates by assigning different subsets of the questionnaire to different sampled respondents. A general question then is how to estimate the regression function based on such blockwise observed data. Currently, this is often carried out with the aid of missing data methods, which may unfortunately suffer intensive computational cost, high variability, and possible large modeling biases in real applications. In this article, we develop a novel approach for estimating the regression function for SQD-type data. We first construct a list of candidate models using available data-blocks separately, and then combine the estimates properly to make an efficient use of all the information. We show the resulting averaged model is asymptotically optimal in the sense that the squared loss and risk are asymptotically equivalent to those of the best but infeasible averaged estimator. Both simulated examples and an application to the SQD dataset from the European Social Survey show the promise of the proposed method.

[1]  Cindy L. Yu,et al.  A computationally efficient method for selecting a split questionnaire design , 2019, Commun. Stat. Simul. Comput..

[2]  Ioannis Andreadis,et al.  The Impact of Splitting a Long Online Questionnaire on Data Quality , 2020 .

[3]  R. Carroll,et al.  Parsimonious Model Averaging With a Diverging Number of Parameters , 2020, Journal of the American Statistical Association.

[4]  Fei Xue,et al.  Integrating Multisource Block-Wise Missing Data in Model Selection , 2019, Journal of the American Statistical Association.

[5]  F. Diebold,et al.  Comparing Predictive Accuracy , 1994, Business Cycles.

[6]  Andrea J Cook,et al.  Challenges and Opportunities for Using Big Health Care Data to Advance Medical Science and Public Health. , 2019, American Journal of Epidemiology.

[7]  Zhong Wang Data integration of electronic medical record under administrative decentralization of medical insurance and healthcare in China: a case study , 2019, Israel Journal of Health Policy Research.

[8]  Stef van Buuren,et al.  Flexible Imputation of Missing Data, Second Edition , 2018 .

[9]  Vera Toepoel,et al.  Modularization in an Era of Mobile Web , 2018, Social Science Computer Review.

[10]  Split Questionnaire Designs: collecting only the data that you need through MCAR and MAR designs , 2018 .

[11]  Jun Shao,et al.  Model Averaging for Prediction With Fragmentary Data , 2018, Journal of Business & Economic Statistics.

[12]  I. Stoop,et al.  Response Rates in the European Social Survey: Increasing, Decreasing, or a Matter of Fieldwork Efforts? , 2018 .

[13]  Ker-Chau Li,et al.  A weight-relaxed model averaging approach for high-dimensional generalized linear models , 2017 .

[14]  S. Leung,et al.  Can Likert Scales be Treated as Interval Scales?—A Simulation Study , 2017 .

[15]  Gregory R. Hancock,et al.  Planned Missing Data Designs in Educational Psychology Research , 2016 .

[16]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[17]  Adrian E. Raftery,et al.  Bayesian Model Averaging: A Tutorial , 2016 .

[18]  Enrique Moral-Benito,et al.  Model Averaging in Economics: An Overview , 2015 .

[19]  Paul M. Thompson,et al.  Bi-level multi-source learning for heterogeneous block-wise missing data , 2014, NeuroImage.

[20]  Ker-Chau Li,et al.  A Model-Averaging Approach for High-Dimensional Regression , 2014 .

[21]  Yuhong Yang,et al.  Adaptive minimax regression estimation over sparse lq-hulls , 2014, J. Mach. Learn. Res..

[22]  Xinyu Zhang Model averaging with covariates that are missing completely at random , 2013 .

[23]  Jeffrey S. Racine,et al.  Jackknife model averaging , 2012 .

[24]  Paul D. Allison,et al.  Handling Missing Data by Maximum Likelihood , 2012 .

[25]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[26]  Guohua Zou,et al.  Least squares model averaging by Mallows criterion , 2010 .

[27]  I. Stoop,et al.  Response and Nonresponse Rates in the European Social Survey , 2010 .

[28]  M. Wedel,et al.  Split Questionnaire Design for Massive Surveys , 2008 .

[29]  B. Hansen Least Squares Model Averaging , 2007 .

[30]  John W Graham,et al.  Planned missing data designs in psychological research. , 2006, Psychological methods.

[31]  S. Moore The value of reducing fear: an analysis using the European Social Survey , 2006 .

[32]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[33]  T. Merkouris,et al.  Combining Independent Regression Estimators From Multiple Surveys , 2004 .

[34]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[35]  Yuhong Yang Adaptive Regression by Mixing , 2001 .

[36]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[37]  A. Gelman,et al.  Not Asked and Not Answered: Multiple Imputation for Multiple Surveys , 1998 .

[38]  Robbert H. Renssen,et al.  Aligning Estimates for Common Variables in Two or More Sample Surveys , 1997 .

[39]  Yuhong Yang MODEL SELECTION FOR NONPARAMETRIC REGRESSION , 1997 .

[40]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .