Big Data meets Causal Survey Research: Understanding Nonresponse in the Recruitment of a Mixed-mode Online Panel

Survey scientists increasingly face the problem of high-dimensionality in their research as digitization makes it much easier to construct highdimensional (or “big”) data sets through tools such as online surveys and mobile applications. Machine learning methods are able to handle such data, and they have been successfully applied to solve predictive problems. However, in many situations, survey statisticians want to learn about causal relationships to draw conclusions and be able to transfer the findings of one survey to another. Standard machine learning methods provide biased estimates of such relationships. We introduce into survey statistics the double ∗GESIS Leibniz Institute for the Social Sciences Correspondence: Barbara Felderer, GESIS Leibniz Institute for the Social Sciences. B2,1 68159 Mannheim. Email: barbara.felderer@gesis.org †University of Hamburg 1 ar X iv :2 10 2. 08 99 4v 1 [ st at .M E ] 1 7 Fe b 20 21 machine learning approach, which gives approximately unbiased estimators of causal parameters, and show how it can be used to analyze survey nonresponse in a high-dimensional panel setting.

[1]  Cliff Lampe,et al.  Big Data in Survey Research AAPOR Task Force Report , 2015 .

[2]  Gesis GESIS Panel - Standard Edition , 2017 .

[3]  Peter Lynn,et al.  From Standardised to Targeted Survey Procedures for Tackling Non-Response and Attrition , 2017 .

[4]  Adam Eck Neural Networks for Survey Researchers , 2018 .

[5]  Curtis S. Signorino,et al.  Using LASSO to model interactions and nonlinearities in survey data , 2018 .

[6]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[7]  Daniell Toth,et al.  ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE REGRESSION TREE MODEL WITH LINKED ADMINISTRATIVE DATA , 2012, 1206.6666.

[8]  Michael W. Link,et al.  Social Media in Public Opinion Research Executive Summary of the Aapor Task Force on Emerging Technologies in Public Opinion Research , 2014 .

[9]  Mark Wooden,et al.  Subjective wellbeing: why weather matters , 2016 .

[10]  Antje Kirchner,et al.  Using Support Vector Machines for Survey Research , 2018 .

[11]  Mingnan Liu Using Machine Learning Models to Predict Attrition in a Survey Panel , 2020 .

[12]  Victor Chernozhukov,et al.  Post-Selection Inference for Generalized Linear Models With Many Controls , 2013, 1304.3969.

[13]  Bella Struminskaya,et al.  Establishing an Open Probability-Based Mixed-Mode Panel of the General Population in Germany , 2018 .

[14]  Trent D. Buskirk,et al.  Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research , 2018 .

[15]  Antje Kirchner,et al.  An Introduction to Machine Learning Methods for Survey Researchers , 2018 .

[16]  I. Plewis,et al.  Can Interviewer Observations of the Interview Predict Future Response , 2017 .

[17]  Martin Spindler,et al.  Valid simultaneous inference in high-dimensional settings (with the HDM package for R) , 2019 .

[18]  Jan-Philipp Kolb,et al.  A Longitudinal Framework for Predicting Nonresponse in Panel Surveys , 2019, ArXiv.

[19]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[20]  Frauke Kreuter,et al.  Facing the Nonresponse Challenge , 2013 .

[21]  Martina Z. Huber,et al.  An Evaluation of Panel Nonresponse and Linkage Consent Bias in a Survey of Employees in Germany , 2016 .

[22]  T. Buskirk,et al.  Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification , 2015 .

[23]  A. Belloni,et al.  Program evaluation and causal inference with high-dimensional data , 2013, 1311.2645.

[24]  Jan Karem Höhne,et al.  Augmenting Surveys With Data From Sensors and Apps: Opportunities and Challenges , 2020, Social Science Computer Review.

[25]  Christian Hansen,et al.  High-Dimensional Methods and Inference on Structural and Treatment Effects , 2013 .

[26]  Timo Gnambs,et al.  Analyzing Nonresponse in Longitudinal Surveys Using Bayesian Additive Regression Trees: A Nonparametric Event History Analysis , 2020 .

[27]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[28]  Michael Bosnjak,et al.  A Comparison of Four Probability-Based Online and Mixed-Mode Panels in Europe , 2016 .

[29]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[30]  Tobias Gummer,et al.  Using Paradata to Predict and Correct for Panel Attrition , 2016 .

[31]  Jaki S. McCarthy,et al.  Modeling Nonresponse in Establishment Surveys: Using an Ensemble Tree Model to Create Nonresponse Propensity Scores and Detect Potential Bias in an Agricultural Survey , 2014 .

[32]  Mario Callegaro,et al.  Paradata in Web Surveys , 2013 .

[33]  Stephanie Eckman,et al.  Using Call-Level Interviewer Observations to Improve Response Propensity Models , 2015 .

[34]  Gabriele B. Durrant,et al.  Multilevel modelling of refusal and non‐contact in household surveys: evidence from six UK Government surveys , 2009 .

[35]  Christian Hansen,et al.  Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach , 2014, 1501.03430.

[36]  Frauke Kreuter,et al.  Tree-based machine learning methods for survey research , 2019 .

[37]  Frauke Kreuter,et al.  Improving Surveys with Paradata: Analytic Uses of Process Information , 2013 .