Automated text analysis in psychology: methods, applications, and future developments*

Recent years have seen rapid developments in automated text analysis methods focused on measuring psychological and demographic properties. While this development has mainly been driven by computer scientists and computational linguists, such methods can be of great value for social scientists in general, and for psychologists in particular. In this paper, we review some of the most popular approaches to automated text analysis from the perspective of social scientists, and give examples of their applications in different theoretical domains. After describing some of the pros and cons of these methods, we speculate about future methodological developments, and how they might change social sciences. We conclude that, despite the fact that current methods have many disadvantages and pitfalls compared to more traditional methods of data collection, the constant increase of computational power and the wide availability of textual data will inevitably make automated text analysis a common tool for psychologists.

[1]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[2]  Christopher K. Hsee,et al.  ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES, Vol. 75 Issue 02 , 1998 .

[3]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[4]  David A. Smith,et al.  Mining Social Deliberation in Online Communication - If You Were Me and I Were You , 2013, EDM.

[6]  Selin Kesebir,et al.  The cultural salience of moral character and virtue declined in twentieth century America , 2012 .

[7]  René Alejandro Venegas Automatic Coherence Profile in Public Speeches of Three Latin American Heads-of-State , 2012, FLAIRS Conference.

[8]  Stefan Kaufmann,et al.  Language and Ideology in Congress , 2011, British Journal of Political Science.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[13]  D. Tank,et al.  Brain magnetic resonance imaging with contrast dependent on blood oxygenation. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Arthur C. Graesser,et al.  Cohesion Relationships in Tutorial Dialogue as Predictors of Affective States , 2009, AIED.

[15]  Danielle S. McNamara,et al.  Analyzing Writing Styles with Coh-Metrix , 2006, FLAIRS.

[16]  J. Pennebaker The Secret Life of Pronouns: What Our Words Say About Us , 2011 .

[17]  P. Johnson-Laird Mental models , 1989 .

[18]  Walter Kintsch,et al.  Cognitive Psychology and Discourse: Recalling and Summarizing Stories , 1978 .

[19]  Peter W. Foltz,et al.  Automated Essay Scoring: Applications to Educational Technology , 1999 .

[20]  J. Pennebaker,et al.  Linguistic styles: language use as an individual difference. , 1999, Journal of personality and social psychology.

[21]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[22]  Preslav Nakov,et al.  Latent Semantic Analysis for German Literature Investigation , 2001, Fuzzy Days.

[23]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[24]  Sidney K. D'Mello,et al.  Predicting Student Knowledge Level from Domain-Independent Function and Content Words , 2010, Intelligent Tutoring Systems.

[25]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[26]  A. Graesser,et al.  Pronoun Use Reflects Standings in Social Hierarchies , 2014 .

[27]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[28]  Eduard Hovy,et al.  Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text , 2006 .

[29]  Nicholas Hookway,et al.  `Entering the blogosphere': some strategies for using blogs in social research , 2008 .

[30]  H. Murray Thematic Apperception Test , 1943 .

[31]  Carla J. Groom,et al.  Gender Differences in Language Use: An Analysis of 14,000 Text Samples , 2008 .

[32]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[33]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[34]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[35]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[36]  Michael Gamon,et al.  Proceedings of the Workshop on Sentiment and Subjectivity in Text , 2006 .

[37]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[38]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[39]  Pete Whitelock,et al.  Proceedings of the 17th international conference on Computational linguistics - Volume 1 , 1998, COLING 1998.

[40]  L. Eastham Research using blogs for data: public documents or private musings? , 2011, Research in nursing & health.

[41]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[42]  M. McCloskey Naive Theories of Motion. , 1982 .

[43]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[44]  Andrew S. Gordon,et al.  A Data-Driven Approach for Classification of Subjectivity in Personal Narratives , 2013, CMN.

[45]  Peter W. Foltz,et al.  The intelligent essay assessor: Applications to educational technology , 1999 .

[46]  R. M. Tobin,et al.  Measuring emotional expression with the Linguistic Inquiry and Word Count. , 2007, The American journal of psychology.

[47]  A. Zivotofsky,et al.  Automated Characterization and Identification of Schizophrenia in Writing , 2009, Journal of Nervous and Mental Disease.

[48]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[49]  G. Vaillant,et al.  Triumphs of Experience: The Men of the Harvard Grant Study , 2012 .

[50]  S. Gosling,et al.  Personality in its natural habitat: manifestations and implicit folk theories of personality in daily life. , 2006, Journal of personality and social psychology.

[51]  R. Nisbett The geography of thought : how Asians and Westerners think differently--and why , 2003 .

[52]  G. King,et al.  Ensuring the Data-Rich Future of the Social Sciences , 2011, Science.

[53]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[54]  T. Yarkoni Psychoinformatics: New Horizons at the Interface of the Psychological and Computing Sciences , 2012 .

[55]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[56]  N. Christakis,et al.  and Tastes, ties, and time: A new social network dataset using Facebook. , 2008 .

[57]  Klaas Willems Trends in Text Linguistics. , 2001 .

[58]  Morteza Dehghani,et al.  Analyzing Political Rhetoric in Conservative and Liberal Weblogs Related to the Construction of the “Ground Zero Mosque” , 2014 .

[59]  W. Keith Campbell,et al.  Increases in Individualistic Words and Phrases in American Books, 1960–2008 , 2012, PloS one.

[60]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[61]  J. Haidt,et al.  Intuitive ethics: how innately prepared intuitions generate culturally variable virtues , 2004, Daedalus.

[62]  Darren Gergle,et al.  The language of emotion in short blog texts , 2008, CSCW.

[63]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[64]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[65]  Noah A. Smith,et al.  Predicting Response to Political Blog Posts with Topic Models , 2009, NAACL.

[66]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[67]  H. Berger Über das Elektrenkephalogramm des Menschen , 1929, Archiv für Psychiatrie und Nervenkrankheiten.

[68]  Peter H. Ditto,et al.  Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism , 2012 .

[69]  C. Pury,et al.  Automation Can Lead to Confounds in Text Analysis , 2011, Psychological science.

[70]  Los Angeles,et al.  Probabilistic Topic Models for Graph Mining , 2014 .

[71]  Boris Egloff,et al.  “Automatic or the People?” , 2011 .

[72]  Philip M. McCarthy,et al.  Linguistic Features of Writing Quality , 2010 .

[73]  Boris Egloff,et al.  The Emotional Timeline of September 11, 2001 , 2010, Psychological science.

[74]  J. Pennebaker,et al.  Linguistic Markers of Psychological Change Surrounding September 11, 2001 , 2004, Psychological science.

[75]  A. diSessa Toward an Epistemology of Physics , 1993 .

[76]  C. Jung Studies in word-association , 2015 .

[77]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[78]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[79]  V. Braun,et al.  Using thematic analysis in psychology , 2006 .

[80]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[81]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[82]  C. Beevers,et al.  Everyday Social Behavior During a Major Depressive Episode , 2013 .

[83]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[84]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[85]  Jon Oberlander,et al.  What Are They Blogging About? Personality, Topic and Motivation in Blogs , 2009, ICWSM.

[86]  P. Greenfield The Changing Psychology of Culture From 1800 Through 2000 , 2013, Psychology Science.

[87]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[88]  Stefan Kaufmann,et al.  Computer assessment of interview data using latent semantic analysis , 2008, Behavior research methods.

[89]  Phillip Wolff,et al.  Evolution and devolution of folkbiological knowledge , 1999, Cognition.

[90]  D. Medin,et al.  Epistemologies in the Text of Children's Books: Native- and non-Native-authored books , 2013 .

[91]  Stephen S. Standifird Reputation and e-commerce: eBay auctions and the asymmetrical impact of positive and negative ratings , 2001 .

[92]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[93]  G. Miller,et al.  Science Perspectives on Psychological the Smartphone Psychology Manifesto on Behalf Of: Association for Psychological Science the Smartphone Psychology Manifesto Previous Research Using Mobile Electronic Devices What Smartphones Can Do Now and Will Be Able to Do in the near Future , 2022 .

[94]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[95]  Arthur C. Graesser,et al.  Language and Discourse Are Powerful Signals of Student Emotions during Tutoring , 2012, IEEE Transactions on Learning Technologies.

[96]  G. Müller,et al.  Experimentelle Beiträge zur Untersuchung des Gedächtnisses , 1894 .

[97]  Carl W. Roberts,et al.  Text analysis for the social sciences : methods for drawing statistical inferences from texts and transcripts , 1997 .

[98]  Semire Dikli,et al.  An Overview of Automated Scoring of Essays. , 2006 .

[99]  Roel Popping,et al.  Knowledge Graphs and Network Text Analysis , 2003 .

[100]  Kristina Lerman,et al.  Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks , 2010, ICWSM.

[101]  A. Villringer,et al.  Non-invasive optical spectroscopy and imaging of human brain function , 1997, Trends in Neurosciences.

[102]  J. Pennebaker,et al.  Word Use in the Poetry of Suicidal and Nonsuicidal Poets , 2001, Psychosomatic medicine.

[103]  Brian A. Nosek,et al.  Liberals and conservatives rely on different sets of moral foundations. , 2009, Journal of personality and social psychology.

[104]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[105]  J. Pennebaker,et al.  Language use of depressed and depression-vulnerable college students , 2004 .

[106]  Judy Kay,et al.  Intelligent Tutoring Systems, 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14-18, 2010, Proceedings, Part I , 2010, Intelligent Tutoring Systems.