Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research

While survey and social science researchers have become well versed in traditional modeling approaches such as multiple regression or logistic regression, there are more contemporary nonparametric techniques that are more flexible in terms of model form and distributional assumptions. Classification and regression trees (CARTs) and random forests represent two of the methods that are being applied more commonly within the survey research context for creating nonresponse adjustments and for creating propensity scores to be used within the responsive/adaptive survey context. Both of these methods can be used for regression or classification related tasks and offer researchers and practitioners excellent alternatives to the more classical approaches. CARTs and random forests can be applied when typical statistical distributional assumptions are not likely satisfied and can incorporate interactions automatically. CART models can be estimated in the presence of missing data and random forest methods can adapt to the complexity of the dataset and can be estimated when the number of predictors is large relative to the sample size. This article provides an accessible description for both of these methods and illustrates their use by developing models that predict survey response from a collection of demographic variables known for both respondents and nonrespondents.

[1]  Guillermo Mendez,et al.  Factors Associated With Persistence in Science and Engineering Majors: An Exploratory Study Using Classification Trees and Random Forests , 2008 .

[2]  E. Polley,et al.  Statistical Applications in Genetics and Molecular Biology Random Forests for Genetic Association Studies , 2011 .

[3]  Jaki S. McCarthy,et al.  Modeling Nonresponse in Establishment Surveys: Using an Ensemble Tree Model to Create Nonresponse Propensity Scores and Detect Potential Bias in an Agricultural Survey , 2014 .

[4]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Daniell Toth,et al.  ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE REGRESSION TREE MODEL WITH LINKED ADMINISTRATIVE DATA , 2012, 1206.6666.

[7]  T. Buskirk,et al.  Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification , 2015 .

[8]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[11]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[12]  Leonard Gordon,et al.  Using Classification and Regression Trees (CART) in SAS® Enterprise Miner TM For Applications in Public Health. , 2013 .

[13]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[14]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[15]  Jaki S. McCarthy,et al.  Who Makes Mistakes? Using Data Mining Techniques to Analyze Reporting Errors in Total Acres Operated , 2009 .