Exploratory Data Analysis using Random Forests ∗

Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists, they are still widely considered to be "black box" models that are not well suited for substantive research: only prediction. We argue that this need not be the case, and present one method, Random Forests, with an emphasis on its practical application for exploratory analysis and substantive interpretation. Random Forests detect interaction and nonlinearity without prespecification, have low generalization error in simulations and in many real-world problems, and can be used with many correlated predictors, even when there are more predictors than observations. Importantly, Random Forests can be interpreted in a substantively relevant way with variable importance measures, bivariate and multivariate partial dependence, proximity matrices, and methods for interaction detection. We provide intuition as well as technical detail about how Random Forests work, in theory and in practice, as well as empirical examples from the literature on American and comparative politics. Furthermore, we provide software implementing the methods we discuss, in order to facilitate their use. ∗Prepared for the 73rd annual MPSA conference, April 16-19, 2015. †Zachary M. Jones is a Ph.D. student in political science at Pennsylvania State University (zmj@zmjones.com). Fridolin Linder is a Ph.D. student in political science at Pennsylvania State University (fridolin.linder@gmail.com); his work is supported by Pennsylvania State University and the National Science Foundation under an IGERT award #DGE-1144860, “Big Data Social Science”.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Margaret E. Roberts,et al.  How Censorship in China Allows Government Criticism but Silences Collective Expression , 2013, American Political Science Review.

[3]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[4]  Zachary M. Jones,et al.  An Empirical Evaluation of Explanations for State Repression , 2014, American Political Science Review.

[5]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[6]  Denis Larocque,et al.  Mixed effects regression trees for clustered data , 2008 .

[7]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .

[8]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[9]  G. Hooker,et al.  Ensemble Trees and CLTs: Statistical Inference for Supervised Learning , 2014 .

[10]  Marc Ratkovic,et al.  Estimating treatment effect heterogeneity in randomized program evaluation , 2013, 1305.5682.

[11]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[12]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[13]  Trevor J. Hastie,et al.  Confidence intervals for random forests: the jackknife and the infinitesimal jackknife , 2013, J. Mach. Learn. Res..

[14]  Reed M. Wood,et al.  The Political Terror Scale (PTS): A Re-introduction and a Comparison to CIRI , 2010 .

[15]  G. King,et al.  Improving Quantitative Studies of International Conflict: A Conjecture , 2000, American Political Science Review.

[16]  Christopher J. Fariss,et al.  Respect for Human Rights has Improved Over Time: Modeling the Changing Standard of Accountability , 2014, American Political Science Review.

[17]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[18]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[19]  John Mingers,et al.  An Empirical Comparison of Selection Measures for Decision-Tree Induction , 1989, Machine Learning.

[20]  Matt Golder,et al.  Big Data, Causal Inference, and Formal Theory: Contradictory Trends in Political Science? , 2014, PS: Political Science & Politics.

[21]  Andrew Gelman,et al.  Exploratory Data Analysis for Complex Models , 2004 .

[22]  H. Ishwaran Variable importance in binary regression trees and forests , 2007, 0711.2434.

[23]  Chad Hazlett,et al.  Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach , 2014, Political Analysis.

[24]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[25]  Joseph Sexton,et al.  Standard errors for bagged and random forest estimators , 2009, Comput. Stat. Data Anal..

[26]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[27]  Gregory A. Huber,et al.  Can Incarcerated Felons Be (Re)integrated into the Political System? Results from a Field Experiment , 2015 .

[28]  J. Gill The Insignificance of Null Hypothesis Significance Testing , 1999 .

[29]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[30]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[31]  Nathaniel N. Beck,et al.  Beyond linearity by default: Generalized additive models , 1998 .

[32]  Kristin M. Bakke,et al.  The perils of policy by p-value: Predicting civil conflicts , 2010 .

[33]  Richard A. Berk,et al.  An Introduction to Ensemble Methods for Data Analysis , 2004 .

[34]  Margaret E. Roberts,et al.  No! Formal Theory, Causal Inference, and Big Data Are Not Contradictory Trends in Political Science , 2014, PS: Political Science & Politics.

[35]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[36]  Xi Chen,et al.  Random survival forests for high‐dimensional data , 2011, Stat. Anal. Data Min..

[37]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[38]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[39]  Stefan Wager,et al.  Uniform Convergence of Random Forests via Adaptive Concentration , 2015 .

[40]  Alan David Hutson,et al.  Resampling Methods for Dependent Data , 2004, Technometrics.

[41]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[42]  Giles Hooker,et al.  Discovering additive structure in black box functions , 2004, KDD.

[43]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[44]  Daniel W. Hill Democracy and the Concept of Personal Integrity Rights , 2016, The Journal of Politics.