Early Detection of Signs of Anorexia and Depression Over Social Media using Effective Machine Learning Frameworks

The CLEF eRisk 2018 challenge focuses on early detection of signs of depression or anorexia using posts or comments over social media. The eRisk lab has organized two tasks this year and released two different corpora for the individual tasks. The corpora are developed using the posts and comments over Reddit, a popular social media. The machine learning group at Ramakrishna Mission Vivekananda Educational and Research Institute (RKMVERI), India has participated in this challenge and individually submitted five results to accomplish the objectives of these two tasks. The paper presents different machine learning techniques and analyze their performance for early risk prediction of anorexia or depression. The techniques involve various classifiers and feature engineering schemes. The simple bag of words model has been used to perform ada boost, random forest, logistic regression and support vector machine classifiers to identify documents related to anorexia or depression in the individual corpora. We have also extracted the terms related to anorexia or depression using metamap, a tool to extract biomedical concepts. Theerefore, the classifiers have been implemented using bag of words features and metamap features individually and subsequently combining these features. The performance of the recurrent neural network is also reported using GloVe and Fasttext word embeddings. Glove and Fasttext are pre-trained word vectors developed using specific corpora e.g., Wikipedia. The experimental analysis on the training set shows that the ada boost classifier using bag of words model outperforms the other methods for task1 and it achieves best score on the test set in terms of precision over all the runs in the challenge. Support vector machine classifier using bag of words model outperforms the other methods in terms of fmeasure for task2. The results on the test set submitted to the challenge suggest that these framework achieve reasonably good performance.

[1]  Tat-Seng Chua,et al.  Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution , 2017, IJCAI.

[2]  C. A. Murthy,et al.  A Similarity Based Supervised Decision Rule for Qualitative Improvement of Text Categorization , 2015, Fundam. Informaticae.

[3]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[4]  C. A. Murthy,et al.  A supervised term selection technique for effective text categorization , 2016, Int. J. Mach. Learn. Cybern..

[5]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[6]  A T McCray,et al.  The Representation of Meaning in the UMLS , 1995, Methods of Information in Medicine.

[7]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[8]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9]  Leonard E. Egede,et al.  Failure to Recognize Depression in Primary Care: Issues and Challenges , 2007, Journal of general internal medicine.

[10]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[14]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[15]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[16]  Eric Horvitz,et al.  Social media as a measurement tool of depression in populations , 2013, WebSci.

[17]  C. A. Murthy,et al.  A Feature Selection Method for Improved Document Classification , 2012, ADMA.

[18]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Sharath Chandra Guntuku,et al.  Detecting depression and mental illness on social media: an integrative review , 2017, Current Opinion in Behavioral Sciences.

[21]  Anita Deswal,et al.  Recognition and treatment of depression and anxiety symptoms in heart failure. , 2009, Primary care companion to the Journal of clinical psychiatry.

[22]  Evelyn Attia,et al.  Anorexia Nervosa : Role of the Primary Care Physician , 2008 .

[23]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[24]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[25]  Fabio Crestani,et al.  Overview of eRisk: Early Risk Prediction on the Internet (Extended Lab Overview) , 2018, CLEF.

[26]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[27]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[28]  Fabio Crestani,et al.  A Test Collection for Research on Depression and Language Use , 2016, CLEF.