The Random Forests statistical technique: An examination of its value for the study of reading

ABSTRACT Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this article, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such data sets. The method is contrasted with other methods of estimating relative variable importance, especially Dominance Analysis and Multimodel Inference. All methods were applied to a data set that gauged eye-movements during reading and offline comprehension in the context of multiple ability measures with high collinearity due to their shared verbal core. We demonstrate that the Random Forests method surpasses other methods in its ability to handle model overfitting and accounts for a comparable or larger amount of variance in reading measures relative to other methods.

[1]  V. Kuperman,et al.  Contributions of Reader- and Text-Level Characteristics to Eye-Movement Patterns During Passage Reading , 2018, Journal of experimental psychology. Learning, memory, and cognition.

[2]  Hubert M. Blalock,et al.  Evaluating the Relative Importance of Variables , 1961 .

[3]  Kristen M. Tooley,et al.  Individual Differences in Eye-Movements During Reading: Working Memory and Speed-of-Processing Effects. , 2012, Journal of eye movement research.

[4]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[5]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[6]  Clinton L. Johns,et al.  Low working memory capacity is only spuriously related to poor reading comprehension , 2014, Cognition.

[7]  Carsten Rahbek,et al.  The patterns and causes of elevational diversity gradients , 2012 .

[8]  Guy Trainin,et al.  Rapid Naming, Phonological Awareness, and Reading: A Meta-Analysis of the Correlation Evidence , 2003 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Jonas S. Almeida,et al.  An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++ , 2009, PloS one.

[11]  J. Tomblin,et al.  Language Basis of Reading and Reading Disabilities: Evidence From a Longitudinal Investigation , 1999 .

[12]  U. Grömping Dependence of Variable Importance in Random Forests on the Shape of the Regressor Space , 2009 .

[13]  Philip Dilts Modelling phonetic reduction in a corpus of spoken English using Random Forests and Mixed-Effects Regression , 2013 .

[14]  D. Budescu Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. , 1993 .

[15]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[16]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[17]  B. Efron Regression and ANOVA with Zero-One Data: Measures of Residual Variation , 1978 .

[18]  Junle Wang,et al.  Study of depth bias of observers in free viewing of still stereoscopic synthetic stimuli , 2012 .

[19]  V. Kuperman,et al.  Effects of individual differences in verbal skills on eye-movement patterns during sentence reading. , 2011, Journal of memory and language.

[20]  Erin M. Freed,et al.  Modeling Reader and Text Interactions During Narrative Comprehension: A Test of the Lexical Quality Hypothesis , 2013, Discourse processes.

[21]  J. W. Johnson A Heuristic Method for Estimating the Relative Weight of Predictor Variables in Multiple Regression , 2000, Multivariate behavioral research.

[22]  Daniel J. Acheson,et al.  New and updated tests of print exposure and reading abilities in college students , 2008, Behavior research methods.

[23]  L. Fuchs,et al.  Sources of Individual Differences in Reading Comprehension and Reading Fluency. , 2003 .

[24]  J. Keenan,et al.  Reading Comprehension Tests Vary in the Skills They Assess: Differential Dependence on Decoding and Oral Comprehension , 2008 .

[25]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[26]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[27]  L. Shapley A Value for n-person Games , 1988 .

[28]  K. Nation,et al.  Assessing reading difficulties: the validity and utility of current measures of reading skill. , 1997, The British journal of educational psychology.

[29]  Denis Larocque,et al.  Mixed-effects random forest for clustered data , 2014 .

[30]  Shinichi Nakagawa,et al.  A general and simple method for obtaining R2 from generalized linear mixed‐effects models , 2013 .

[31]  R. Harald Baayen,et al.  Models, forests, and trees of York English: Was/were variation as a case study for statistical practice , 2012, Language Variation and Change.

[32]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[33]  Elizabeth L. Tighe,et al.  A dominance analysis approach to determining predictor importance in third, seventh, and tenth grade reading comprehension skills , 2014, Reading and writing.

[34]  Erik D. Reichle,et al.  Using E-Z Reader to examine the concurrent development of eye-movement control and reading skill. , 2013, Developmental review : DR.

[35]  Jack M. Fletcher,et al.  Dimensions Affecting the Assessment of Reading Comprehension , 2005 .

[36]  S. Lipovetsky,et al.  Analysis of regression in game theory approach , 2001 .

[37]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[38]  Erich Barke,et al.  Hierarchical partitioning , 1996, Proceedings of International Conference on Computer Aided Design.

[39]  R. Darlington,et al.  Multiple regression in psychological research and practice. , 1968, Psychological bulletin.