Tree-based machine learning methods for survey research

Predictive modeling methods from the field of machine learning have become a popular tool across various disciplines for exploring and analyzing diverse data. These methods often do not require specific prior knowledge about the functional form of the relationship under study and are able to adapt to complex non-linear and non-additive interrelations between the outcome and its predictors while focusing specifically on prediction performance. This modeling perspective is beginning to be adopted by survey researchers in order to adjust or improve various aspects of data collection and/or survey management. To facilitate this strand of research, this paper (1) provides an introduction to prominent tree-based machine learning methods, (2) reviews and discusses previous and (potential) prospective applications of tree-based supervised learning in survey research, and (3) exemplifies the usage of these techniques in the context of modeling and predicting nonresponse in panel surveys.

[1]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[2]  Robert M. Groves,et al.  Responsive design for household surveys: tools for actively controlling survey errors and costs , 2006 .

[3]  Matthias Schonlau,et al.  Selection Bias in Web Surveys and the Use of Propensity Scores , 2006 .

[4]  James Wagner,et al.  Adaptive Survey Design , 2017 .

[5]  Antje Kirchner,et al.  An Introduction to Machine Learning Methods for Survey Researchers , 2018 .

[6]  Jelke Bethlehem,et al.  Handbook of Nonresponse in Household Surveys , 2011 .

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Daniell Toth,et al.  ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE REGRESSION TREE MODEL WITH LINKED ADMINISTRATIVE DATA , 2012, 1206.6666.

[9]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[10]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[11]  K. Hornik,et al.  Generalized M‐fluctuation tests for parameter instability , 2007 .

[12]  Kelly S. McConville,et al.  Automated selection of post‐strata using a model‐assisted regression tree estimator , 2017, Scandinavian Journal of Statistics.

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[15]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[16]  Peter Lynn,et al.  Quality Profile: British Household Panel Survey , 2006 .

[17]  Matthias Schonlau,et al.  Semi-automated categorization of open-ended questions , 2016 .

[18]  Jaki S. McCarthy,et al.  Modeling Nonresponse in Establishment Surveys: Using an Ensemble Tree Model to Create Nonresponse Propensity Scores and Detect Potential Bias in an Agricultural Survey , 2014 .

[19]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[20]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .

[21]  Richard A. Berk,et al.  An Introduction to Ensemble Methods for Data Analysis , 2004 .

[22]  R. Borgoni,et al.  Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures , 2013 .

[23]  Gert G. Wagner,et al.  The German Socio-Economic Panel Study (SOEP) - Evolution, Scope and Enhancements , 2007 .

[24]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[25]  Sendhil Mullainathan,et al.  Machine Learning: An Applied Econometric Approach , 2017, Journal of Economic Perspectives.

[26]  Carolin Strobl,et al.  The potential of model-based recursive partitioning in the social sciences - Revisiting Ockham's Razor , 2010 .

[27]  Jean D. Opsomer,et al.  Model-Assisted Survey Estimation with Modern Prediction Techniques , 2017 .

[28]  Peter Lynn,et al.  From Standardised to Targeted Survey Procedures for Tackling Non-Response and Attrition , 2017 .

[29]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[30]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[31]  Frauke Kreuter,et al.  Occupation coding during the interview , 2018 .

[32]  Stefano M. Iacus Big Data and Social Science - A Practical Guide to Methods and Tools , 2017 .

[33]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[34]  J. Friedman Stochastic gradient boosting , 2002 .

[35]  Herschel Sanders,et al.  The Impact of Targeted Data Collection on Nonresponse Bias in an Establishment Survey: A Simulation Study of Adaptive Survey Design , 2017 .

[36]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[37]  Weight Adjustment Methods using Multilevel Propensity Models and Random Forests , 2015 .

[38]  Martin Kroh,et al.  Documentation of Sample Sizes and Panel Attrition in the German Socio Economic Panel (SOEP) (1984 until 2012) , 2006 .

[39]  Jan Goebel,et al.  Mikrodaten, Gewichtung und Datenstruktur der Längsschnittstudie Sozio-oekonomisches Panel (SOEP) , 2008 .

[40]  Trent D. Buskirk,et al.  Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research , 2018 .

[41]  Georgiy V. Bobashev,et al.  Random forest methodology for model-based recursive partitioning: the mobForest package for R , 2013, BMC Bioinformatics.

[42]  M. Kroh,et al.  Documentation of Sample Sizes and Panel Attrition in the German Socio Economic Panel (SOEP) 1984 - 2004 , 2005 .

[43]  Fridolin Linder,et al.  Exploratory Data Analysis using Random Forests ∗ , 2015 .

[44]  Heping Zhang,et al.  Recursive Partitioning and Applications , 1999 .

[45]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[46]  Willem E. Saris,et al.  The development of the program SQP 2.0 for the prediction of the quality of survey questions , 2011 .

[47]  Hal R. Varian,et al.  Big Data: New Tricks for Econometrics , 2014 .

[48]  Peter Lynn,et al.  Methodology of longitudinal surveys , 2009 .

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[50]  Achim Zeileis,et al.  Partykit: a modular toolkit for recursive partytioning in R , 2015, J. Mach. Learn. Res..

[51]  Daniell Toth,et al.  Regression Tree Models for Analyzing Survey Response , 2014 .

[52]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[53]  Jill M. Montaquila,et al.  Using Classification and Regression Trees to Model Survey Nonresponse , 2015 .

[54]  Chad Hazlett,et al.  Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach , 2014, Political Analysis.

[55]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[56]  Adam Kapelner,et al.  bartMachine: Machine Learning with Bayesian Additive Regression Trees , 2013, 1312.2171.

[57]  T. Buskirk,et al.  Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification , 2015 .

[58]  Brandon M. Greenwell pdp: An R Package for Constructing Partial Dependence Plots , 2017, R J..

[59]  Gretchen G. Moisen,et al.  Model-Assisted Survey Regression Estimation with the Lasso , 2017 .

[60]  J. Michael Brick,et al.  Unit Nonresponse and Weighting Adjustments: A Critical Review , 2013 .

[61]  Andrey Peytchev,et al.  Reduction of Nonresponse Bias through Case Prioritization , 2010 .

[62]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[63]  Andrew William Mercer,et al.  Selection Bias in Nonprobability Surveys: A Causal Inference Approach , 2018 .

[64]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[65]  Jaki S. McCarthy,et al.  Who Makes Mistakes? Using Data Mining Techniques to Analyze Reporting Errors in Total Acres Operated , 2009 .