Mood Disorder Detection in Adolescents by Classification Trees, Random Forests and XGBoost in Presence of Missing Data

We apply tree-based classification algorithms, namely the classification trees, with the use of the rpart algorithm, random forests and XGBoost methods to detect mood disorder in a group of 2508 lower secondary school students. The dataset presents many challenges, the most important of which is many missing data as well as the being heavily unbalanced (there are few severe mood disorder cases). We find that all algorithms are specific, but only the rpart algorithm is sensitive; i.e., it is able to detect cases of real cases mood disorder. The conclusion of this paper is that this is caused by the fact that the rpart algorithm uses the surrogate variables to handle missing data. The most important social-studies-related result is that the adolescents’ relationships with their parents are the single most important factor in developing mood disorders—far more important than other factors, such as the socio-economic status or school success.

[1]  Karen A. F. Copeland An Introduction to Categorical Data Analysis , 1997 .

[2]  Rayid Ghani,et al.  Big Data and Social Science: A Practical Guide to Methods and Tools , 2016 .

[3]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[4]  T. Contarino Deliberate self-harm in adolescence , 2006 .

[5]  A. Agresti An introduction to categorical data analysis , 1997 .

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[8]  C. Fitzpatrick,et al.  Depression in adolescence. , 1995, Irish medical journal.

[9]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[10]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[11]  Margus Pedaste,et al.  Mining Educational Data to Predict Students’ Performance through Procrastination Behavior , 2019, Entropy.

[12]  Fei Tang,et al.  Random forest missing data algorithms , 2017, Stat. Anal. Data Min..

[13]  Yishay Mansour Pessimistic Decision Tree Pruning Based on Tree Size , 1997, ICML 1997.

[14]  M. Angermeyer,et al.  Relationship between parental bonding and mood disorder in six European countries , 2006, Psychiatry Research.

[15]  H. Bohman,et al.  Poor family relationships in adolescence as a risk factor of in-patient psychiatric care across the life course: A prospective cohort study , 2020, Scandinavian journal of public health.

[16]  Siriporn Sawangarreerak,et al.  Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression , 2020, Inf..

[17]  KearnsMichael,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994 .

[18]  Sabrina Eberhart,et al.  Applied Missing Data Analysis , 2016 .

[19]  A. Carr Depression and attempted suicide in adolescence , 2002 .

[20]  E. Goodman,et al.  Low Social Status Markers: Do They Predict Depressive Symptoms in Adolescence? , 2011, Race and social problems.

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[23]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[24]  James N. Butcher,et al.  Abnormal Psychology and Modern Life , 1980 .

[25]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[26]  David W. Aha,et al.  Simplifying decision trees: A survey , 1997, The Knowledge Engineering Review.