Quantifying Feature Importance for Detecting Depression using Random Forest

Feature selection based on importance is a fundamental step in machine learning models because it serves as a vital technique to orient the use of variables to what is most efficient and effective for a given machine learning model. In this study, an explainable machine learning model based on Random forest, is built to address the problem of identification of depression level for Twitter users. This model reflects its transparency through calculating its feature importance. There are several techniques to quantify the importance of features. However, in this study, random forest is used as both a classifier, which has over-performing aspects over many classifiers such as decision trees, and a method for weighting the input features as their importance imply. In this study, the importance of features is measured using different techniques including random forest, and the results of these techniques are compared. Furthermore, feature importance uses the concept of weighting the input variables inside a complete system for recommending a solution for depressed persons. The experimental results confirm the superiority of random forest over other classifiers using three different methods for measuring the features importance. The accuracy of random forest classification reached 84.7%, and the importance of features increased the classifier accuracy to 84.9%. Keywords—Machine learning; random forest; feature selection; feature importance; depression; emotions; twitter

[1]  Songyot Nakariyakul,et al.  High-dimensional hybrid feature selection using interaction information-guided search , 2018, Knowl. Based Syst..

[2]  Mike Conway,et al.  Towards Automatically Classifying Depressive Symptoms from Twitter Data for Population Health , 2016, PEOPLES@COLING.

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[5]  Bin Hu,et al.  Study on Feature Selection Methods for Depression Detection Using Three-Electrode EEG Data , 2018, Interdisciplinary Sciences: Computational Life Sciences.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Celine Vens,et al.  Random Forest Based Feature Induction , 2011, 2011 IEEE 11th International Conference on Data Mining.

[8]  A. Mitchell,et al.  Clinical diagnosis of depression in primary care: a meta-analysis , 2009, The Lancet.

[9]  S. Koteeswaran,et al.  Feature Selection using Random Forest Method for Sentiment Analysis , 2016 .

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Diana Inkpen,et al.  Monitoring Tweets for Depression to Detect At-risk Users , 2017, CLPsych@ACL.

[12]  Tianwei Yu,et al.  A Deep Neural Network Model using Random Forest to Extract Feature Representation for Gene Expression Data Classification , 2018, Scientific Reports.

[13]  Mourad Ykhlef,et al.  Machine Learning-based Approach for Depression Detection in Twitter Using Content and Activity Features , 2020, IEICE Trans. Inf. Syst..

[14]  Samina Khalid,et al.  A survey of feature selection and feature extraction techniques in machine learning , 2014, 2014 Science and Information Conference.

[15]  Dhruba Kumar Bhattacharyya,et al.  An effective ensemble classification framework using random forests and a correlation based feature selection technique , 2017, Trans. GIS.

[16]  Moin Nadeem,et al.  Identifying Depression on Twitter , 2016, ArXiv.

[17]  Arkaprabha Sau,et al.  Predicting anxiety and depression in elderly patients using machine learning technology , 2017 .

[18]  Shahin Ara Begum,et al.  A Survey on Case-based Reasoning in Medicine , 2016 .

[19]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[20]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[21]  Sharath Chandra Guntuku,et al.  Detecting depression and mental illness on social media: an integrative review , 2017, Current Opinion in Behavioral Sciences.

[22]  Christopher M. Danforth,et al.  Forecasting the onset and course of mental illness with Twitter data , 2016, Scientific Reports.

[23]  Guido Caldarelli,et al.  Echo Chambers: Emotional Contagion and Group Polarization on Facebook , 2016, Scientific Reports.

[24]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[25]  Adelina Tang,et al.  A Qualitative Evaluation of Random Forest Feature Learning , 2014, SCDM.

[26]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[27]  Arunkumar Chinnaswamy,et al.  Hybrid Feature Selection Using Correlation Coefficient and Particle Swarm Optimization on Microarray Gene Expression Data , 2015, IBICA.

[28]  J. Rabinowitz,et al.  Post-traumatic stress disorder in primary-care settings: prevalence and physicians' detection , 2001, Psychological Medicine.

[29]  A. Boulesteix,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[30]  Víctor M. Prieto,et al.  Twitter: A Good Place to Detect Health Conditions , 2014, PloS one.

[31]  Tat-Seng Chua,et al.  Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution , 2017, IJCAI.