Finding structure in data using multivariate tree boosting

Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles such as random forests (Strobl, Malley, & Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called gradient boosted regression trees (Friedman, 2001). Our extension, multivariate tree boosting, is a method for nonparametric regression that is useful for identifying important predictors, detecting predictors with nonlinear effects and interactions without specification of such effects, and for identifying predictors that cause 2 or more outcome variables to covary. We provide the R package "mvtboost" to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package "gbm" (Ridgeway, 2015) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff & Keyes, 1995). Simulations verify that our approach identifies predictors with nonlinear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions. (PsycINFO Database Record

[1]  David V. Budescu,et al.  An Extension of Dominance Analysis to Canonical Correlation Analysis , 2009 .

[2]  G. Hooker,et al.  Ensemble Trees and CLTs: Statistical Inference for Supervised Learning , 2014 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Balázs Kégl,et al.  MULTIBOOST: A Multi-purpose Boosting Package , 2012, J. Mach. Learn. Res..

[9]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[10]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[11]  G Tutz,et al.  Regularization for Generalized Additive Mixed Models by Likelihood-based Boosting , 2012, Methods of Information in Medicine.

[12]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[13]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[14]  Cindy S. Bergeman,et al.  What contributes to perceived stress in later life? A recursive partitioning approach. , 2011, Psychology and aging.

[15]  B. Thompson Canonical Correlation Analysis , 1984 .

[16]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[17]  Boris Chidlovskii,et al.  Boosting Multi-Task Weak Learners with Applications to Textual and Social Data , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[18]  Anilú Franco Arcega Multivariate Decision Trees Using Different Splitting Attribute Subsets for Large Datasets , 2010 .

[19]  Torsten Hothorn,et al.  Model-Based Boosting , 2015 .

[20]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .

[21]  Zina M. Ibrahim,et al.  Advances in Artificial Intelligence , 2003, Lecture Notes in Computer Science.

[22]  Inger P. Davis,et al.  Factors associated with caregiver stability in permanent placements: a classification tree approach. , 2011, Child abuse & neglect.

[23]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[24]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[25]  U. Grömping Variable importance in regression models , 2015 .

[26]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[27]  S. Maxwell,et al.  Cumulative and compensatory effects of competence and incompetence on depressive symptoms in children. , 1997, Journal of abnormal psychology.

[28]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[29]  Ya Zhang,et al.  Multi-task learning for boosting with application to web search ranking , 2010, KDD.

[30]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[31]  Ulman Lindenberger,et al.  Structural equation model trees. , 2013, Psychological methods.

[32]  C S Bergeman,et al.  Trait Stress Resistance and Dynamic Stress Dissipation on Health and Well-Being: The Reservoir Model , 2014, Research in human development.

[33]  Denis Larocque,et al.  Multivariate trees for mixed outcomes , 2009, Comput. Stat. Data Anal..

[34]  R. R. Hocking,et al.  Selection of the Best Subset in Regression Analysis , 1967 .

[35]  J. Friedman Stochastic gradient boosting , 2002 .

[36]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[37]  Kim F. Nimon,et al.  Interpreting Multiple Linear Regression: A Guidebook of Variable Importance , 2012 .

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  Gerhard Tutz,et al.  Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting , 2011 .

[40]  J. Block,et al.  IQ and ego-resiliency: conceptual and empirical connections and separateness. , 1996, Journal of personality and social psychology.

[41]  Torsten Hothorn,et al.  Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[42]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[43]  Kim F. Nimon,et al.  Revisiting Interpretation of Canonical Correlation Analysis: A Tutorial and Demonstration of Canonical Commonality Analysis , 2010, Multivariate behavioral research.

[44]  Irvin Sam Schonfeld,et al.  Center for Epidemiologic Studies Depression Scale , 2020, Definitions.

[45]  L. Manovich,et al.  Trending: The Promises and the Challenges of Big Social Data , 2012 .

[46]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[47]  R. Ursano,et al.  The Impact of a Military Air Disaster on The Health of Assistance Workers: A Prospective Study , 1989, The Journal of nervous and mental disease.

[48]  G. Ridgeway The State of Boosting ∗ , 1999 .

[49]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[50]  Emil Pitkin,et al.  Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation , 2013, 1309.6392.

[51]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[52]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[53]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[54]  J. Pallant,et al.  Development and Validation of a Scale to Measure Perceived Control of Internal States , 2000, Journal of personality assessment.

[55]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[56]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[57]  Glenn De ' ath,et al.  MULTIVARIATE REGRESSION TREES: A NEW TECHNIQUE FOR MODELING SPECIES-ENVIRONMENT RELATIONSHIPS , 2002 .

[58]  R. Carleton,et al.  Center for Epidemiologic Studies: Depression Scale , 2020, Encyclopedia of Personality and Individual Differences.

[59]  Melissa S. Yale,et al.  Differential Item Functioning , 2014 .

[60]  Xiaogang Wang,et al.  Boosted multi-task learning for face verification with applications to web image and video search , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Martin Guha,et al.  Encyclopedia of Statistics in Behavioral Science , 2006 .

[62]  T. Kamarck,et al.  A global measure of perceived stress. , 1983, Journal of health and social behavior.

[63]  H. Wainer,et al.  Differential Item Functioning. , 1994 .

[64]  Conor V. Dolan,et al.  TATES: Efficient Multivariate Genotype-Phenotype Analysis for Genome-Wide Association Studies , 2013, PLoS genetics.

[65]  M. Segal Tree-Structured Methods for Longitudinal Data , 1992 .

[66]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[67]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[68]  Benjamin Hofner,et al.  Model-based boosting in R: a hands-on tutorial using the R package mboost , 2012, Computational Statistics.

[69]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[70]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[71]  D. Russell,et al.  The revised UCLA Loneliness Scale: concurrent and discriminant validity evidence. , 1980, Journal of personality and social psychology.

[72]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[73]  C. Keyes,et al.  The structure of psychological well-being revisited. , 1995, Journal of personality and social psychology.

[74]  David W. Reid,et al.  The desired control measure and adjustment among the elderly , 1981 .

[75]  Yu-Shan Shih,et al.  Splitting variable selection for multivariate regression trees , 2007 .

[76]  K. Heller,et al.  Measures of perceived social support from friends and from family: Three validation studies , 1983, American journal of community psychology.

[77]  Donald E. Brown,et al.  Classification trees with optimal multivariate decision nodes , 1996, Pattern Recognit. Lett..

[78]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[79]  Jeffrey S. Simonoff,et al.  RE-EM trees: a data mining approach for longitudinal and clustered data , 2011, Machine Learning.

[80]  R. Haase,et al.  Multivariate analysis of variance. , 1987 .

[81]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[82]  Saso Dzeroski,et al.  Constraint Based Induction of Multi-objective Regression Trees , 2005, KDID.

[83]  C. Waternaux,et al.  Classification trees distinguish suicide attempters in major psychiatric disorders: a model of clinical decision making. , 2008, The Journal of clinical psychiatry.

[84]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[85]  Carla E. Brodley,et al.  Multivariate decision trees , 2004, Machine Learning.

[86]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[87]  G. De’ath MULTIVARIATE REGRESSION TREES: A NEW TECHNIQUE FOR MODELING SPECIES–ENVIRONMENT RELATIONSHIPS , 2002 .

[88]  David V Budescu,et al.  Erratum to “An Extension of Dominance Analysis to Canonical Correlation Analysis” , 2009, Multivariate behavioral research.

[89]  W. Loh,et al.  Regression trees for longitudinal and multiresponse data , 2012, 1209.4690.

[90]  Mark R. Segal,et al.  Multivariate random forests , 2011, WIREs Data Mining Knowl. Discov..

[91]  Stacey B. Scott,et al.  Combinations of stressors in midlife: examining role and domain stressors using regression trees and random forests. , 2013, The journals of gerontology. Series B, Psychological sciences and social sciences.

[92]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[93]  Gerhard Tutz,et al.  Boosting techniques for nonlinear time series models , 2012 .

[94]  R. M. Durand,et al.  Redundancy analysis: An alternative to canonical correlation and multivariate multiple regression in exploring interset associations. , 1988 .

[95]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[96]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[97]  L. Breslow,et al.  Measurement of physical health in a general population survey. , 1971, American journal of epidemiology.

[98]  B. Peter,et al.  BOOSTING FOR HIGH-MULTIVARIATE RESPONSES IN HIGH-DIMENSIONAL LINEAR REGRESSION , 2006 .

[99]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[100]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[101]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[102]  K. Hornik,et al.  A Laboratory for Recursive Partytioning , 2015 .

[103]  Gerhard Tutz,et al.  Boosting nonlinear additive autoregressive time series , 2009, Comput. Stat. Data Anal..

[104]  Xiaogang Wang,et al.  Boosted multi-task learning for face verification with applications to web image and video search , 2009, CVPR.

[105]  Manuel A. R. Ferreira,et al.  Genetics and population analysis A multivariate test of association , 2009 .

[106]  Letitia Anne Peplau,et al.  The Revised UCLA Loneliness Scale: Concurrent and Discriminant Validity Evidence , 1980 .