Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

We investigate the value-add of topic modeling in text analysis for depression, and for neuroticism as a strongly associated personality measure. Using Pennebaker’s Linguistic Inquiry and Word Count (LIWC) lexicon to provide baseline features, we show that straightforward topic modeling using Latent Dirichlet Allocation (LDA) yields interpretable, psychologically relevant “themes” that add value in prediction of clinical assessments.

[1]  K. Bretonnel Cohen,et al.  Sentiment Analysis of Suicide Notes: A Shared Task , 2012, Biomedical informatics insights.

[2]  Yair Neuman,et al.  Proactive screening for depression through metaphorical and automatic text analysis , 2012, Artif. Intell. Medicine.

[3]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[4]  Viet-An Nguyen,et al.  Lexical and Hierarchical Topic Regression , 2013, NIPS.

[5]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[6]  Izhak Shafran,et al.  Hello, Who is Calling?: Can Words Reveal the Social Nature of Conversations? , 2012, HLT-NAACL.

[7]  E. Walker,et al.  Diagnostic and Statistical Manual of Mental Disorders , 2013 .

[8]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[9]  Emily Tucker Prud'hommeaux,et al.  Computational prosodic markers for autism. , 2010, Autism : the international journal of research and practice.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  O. John,et al.  Paradigm shift to the integrative Big Five trait taxonomy: History, measurement, and conceptual issues. , 2008 .

[12]  Brian Roark,et al.  Fully Automated Neuropsychological Assessment for Detecting Mild Cognitive Impairment , 2012, INTERSPEECH.

[13]  Eric Horvitz,et al.  Predicting postpartum changes in emotion and behavior via social media , 2013, CHI.

[14]  J. Pennebaker,et al.  Language use of depressed and depression-vulnerable college students , 2004 .

[15]  A. Beck,et al.  An inventory for measuring depression. , 1961, Archives of general psychiatry.

[16]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[17]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[18]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[19]  Serguei V. S. Pakhomov,et al.  Computerized assessment of syntactic complexity in Alzheimer’s disease: a case study of Iris Murdoch’s writing , 2011, Behavior research methods.

[20]  Joelle D. Powers,et al.  Increasing Access to Mental Health Services in Schools through Community-Engaged Research: Results from a One-Year Pilot Project , 2014 .

[21]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[22]  S. Srivastava,et al.  The Big Five Trait taxonomy: History, measurement, and theoretical perspectives. , 1999 .

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  J. Pennebaker,et al.  Linguistic styles: language use as an individual difference. , 1999, Journal of personality and social psychology.

[25]  Brian Roark,et al.  Spoken Language Derived Measures for Detecting Mild Cognitive Impairment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Brian Roark,et al.  Discriminative Joint Modeling of Lexical Variation and Acoustic Confusion for Automated Narrative Retelling Assessment , 2013, NAACL.

[27]  Brian Roark,et al.  Classification of Atypical Language in Autism , 2011, CMCL@ACL.