Gaining insights from social media language: Methodologies and challenges.

Language data available through social media provide opportunities to study people at an unprecedented scale. However, little guidance is available to psychologists who want to enter this area of research. Drawing on tools and techniques developed in natural language processing, we first introduce psychologists to social media language research, identifying descriptive and predictive analyses that language data allow. Second, we describe how raw language data can be accessed and quantified for inclusion in subsequent analyses, exploring personality as expressed on Facebook to illustrate. Third, we highlight challenges and issues to be considered, including accessing and processing the data, interpreting effects, and ethical issues. Social media has become a valuable part of social life, and there is much we can learn by bringing together the tools of computer science with the theories and insights of psychology. (PsycINFO Database Record

[1]  R. Tibshirani,et al.  The elements of statistical learning: data mining, inference, and prediction, 2nd Edition , 2020 .

[2]  Nathaniel E. Helwig,et al.  Analyzing spatiotemporal trends in social media data via smoothing spline analysis of variance , 2015 .

[3]  Andrew S. Gordon,et al.  Insights on Privacy and Ethics from the Web's Most Prolific Storytellers , 2015, WebSci.

[4]  L. Ungar,et al.  Using Twitter to Measure Public Discussion of Diseases: A Case Study , 2015, JMIR public health and surveillance.

[5]  Gregory J. Park,et al.  Automatic personality assessment through social media language. , 2015, Journal of personality and social psychology.

[6]  Derek Ruths,et al.  Geolocation Prediction in Twitter Using Social Networks: A Critical Analysis and Review of Current Practice , 2015, ICWSM.

[7]  Dhavan V. Shah,et al.  Big Data, Digital Media, and Computational Social Science , 2015 .

[8]  Gregory J. Park,et al.  Psychological Language on Twitter Predicts County-Level Heart Disease Mortality , 2015, Psychological science.

[9]  I. Deary,et al.  Personality, health, and brain integrity: the Lothian birth cohort study 1936. , 2014, Health psychology : official journal of the Division of Health Psychology, American Psychological Association.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[12]  Zhi Liu,et al.  SPOT: Locating Social Media Users Based on Social Network Context , 2014, Proc. VLDB Endow..

[13]  Eyal Sagi,et al.  Automated text analysis in psychology: methods, applications, and future developments* , 2014, Language and Cognition.

[14]  Tao Cheng,et al.  Event Detection using Twitter: A Spatio-Temporal Approach , 2014, PloS one.

[15]  Ryen W. White,et al.  Toward Enhanced Pharmacovigilance Using Patient-Generated Data on the Internet , 2014, Clinical pharmacology and therapeutics.

[16]  John Nerbonne,et al.  The Secret Life of Pronouns. What Our Words Say About Us , 2014, Lit. Linguistic Comput..

[17]  Adam D. I. Kramer,et al.  Detecting Emotional Contagion in Massive Social Networks , 2014, PloS one.

[18]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.

[19]  Christine L. Borgman,et al.  Big Data, Little Data, No Data: The Contested Landscape of Data Sharing and Reuse , 2013 .

[20]  Zhiyuan Cheng,et al.  Location prediction in social media based on tie strength , 2013, CIKM.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[23]  Ryen W. White,et al.  Pursuing insights about healthcare utilization via geocoded search queries , 2013, SIGIR.

[24]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[25]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[26]  David Jurgens,et al.  That's What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships , 2013, ICWSM.

[27]  Kalina Bontcheva,et al.  Where's @wally?: a classification approach to geolocating users based on their social ties , 2013, HT.

[28]  Erika Check Hayden,et al.  Guidance issued for US Internet research , 2013, Nature.

[29]  L. R. Goldberg,et al.  Childhood conscientiousness relates to objectively measured adult physical health four decades later. , 2013, Health psychology : official journal of the Division of Health Psychology, American Psychological Association.

[30]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[31]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  Timothy Baldwin,et al.  Geolocation Prediction in Social Media Data by Finding Location Indicative Words , 2012, COLING.

[33]  G. Vaillant,et al.  Triumphs of Experience: The Men of the Harvard Grant Study , 2012 .

[34]  David C. Atkins,et al.  Topic models: a novel method for modeling couple and family text data. , 2012, Journal of family psychology : JFP : journal of the Division of Family Psychology of the American Psychological Association.

[35]  Rui Li,et al.  Multiple Location Profiling for Users and Relationships from Social Network and Content , 2012, Proc. VLDB Endow..

[36]  Jeffrey Nichols,et al.  Where Is This Tweet From? Inferring Home Locations of Twitter Users , 2012, ICWSM.

[37]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[38]  A. Diamantopoulos,et al.  Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective , 2012 .

[39]  ChengXiang Zhai,et al.  Mining Text Data , 2012, Springer US.

[40]  Marc Sebban,et al.  Supervised learning of Gaussian mixture models for visual vocabulary generation , 2012, Pattern Recognit..

[41]  Shuki J. Cohen,et al.  Construction and Preliminary Validation of a Dictionary for Cognitive Rigidity: Linguistic Markers of Overconfidence and Overgeneralization and their Concomitant Psychological Distress , 2011, Journal of Psycholinguistic Research.

[42]  Sheila Kinsella,et al.  "I'm eating a sandwich in Glasgow": modeling locations with tweets , 2011, SMUC '11.

[43]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[44]  Mark A. Finlayson,et al.  Detecting Multi-Word Expressions Improves Word Sense Disambiguation , 2011, MWE@ACL.

[45]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[46]  Howard S. Friedman,et al.  The Longevity Project: Surprising Discoveries for Health and Long Life from the Landmark Eight-Decade Study , 2011 .

[47]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[48]  Daniel Grühn,et al.  Discrete affects across the adult lifespan : Evidence for multidimensionality and multi-directionality of affective experience in young, middle-aged, and older adults , 2010 .

[49]  Joseph P. Turian,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[50]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[51]  Changhu Wang,et al.  Probabilistic models for supervised dictionary learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Tal Yarkoni Personality in 100,000 Words: A large-scale analysis of personality and word use among bloggers. , 2010, Journal of research in personality.

[53]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[54]  Charles Elkan,et al.  Accounting for burstiness in topic models , 2009, ICML '09.

[55]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[56]  Bo Thiesson,et al.  Markov Topic Models , 2009, AISTATS.

[57]  Jon Oberlander,et al.  What Are They Blogging About? Personality, Topic and Motivation in Blogs , 2009, ICWSM.

[58]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[59]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[60]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[61]  B Chaix,et al.  Neighbourhood social interactions and risk of acute myocardial infarction , 2007, Journal of Epidemiology & Community Health.

[62]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[63]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[64]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[65]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[66]  Hanna M. Wallach Topic modeling: beyond bag-of-words , 2006, ICML.

[67]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[68]  John A. Johnson,et al.  The international personality item pool and the future of public-domain personality measures ☆ , 2006 .

[69]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[70]  James Franklin The elements of statistical learning: data mining, inference and prediction , 2005 .

[71]  P. Murphy,et al.  NEIGHBORHOODS AND HEALTH , 2004 .

[72]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[73]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[74]  Eleni Stroulia,et al.  Latent Dirichlet Allocation , 2003, The Art and Science of Analyzing Software Data.

[75]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[76]  Michael B. W. Wolfe,et al.  Use of latent semantic analysis for predicting psychological phenomena: Two issues and proposed solutions , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[77]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[78]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[79]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[80]  J. Pennebaker,et al.  Linguistic styles: language use as an individual difference. , 1999, Journal of personality and social psychology.

[81]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[82]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[83]  M. Cox,et al.  Application-controlled demand paging for out-of-core visualization , 1997, Proceedings. Visualization '97 (Cat. No. 97CB36155).

[84]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[85]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[86]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[87]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[88]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[89]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[90]  Maarten Sap,et al.  Extracting Human Temporal Orientation from Facebook Language , 2015, NAACL.

[91]  S. Farnham Neighborhood Community Well-being and Social Media , 2015 .

[92]  Gregory J. Park,et al.  From "Sooo excited!!!" to "So proud": using language to study development. , 2014, Developmental psychology.

[93]  G. Duncan,et al.  Replication and robustness in developmental research. , 2014, Developmental psychology.

[94]  Tomas Chamorro-Premuzic,et al.  Facebook Psychology: Popular Questions Answered by Research , 2012 .

[95]  E. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[96]  V. Rokhlin,et al.  A randomized algorithm for the decomposition of matrices , 2011 .

[97]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[98]  Cindy K. Chung,et al.  The development and psychometric properties of LIWC2007 , 2007 .

[99]  J. Tenenbaum,et al.  Topics in Semantic Representation , 2007 .

[100]  Alastair J. Gill Personality and language: the projection and perception of personality in computer-mediated communication , 2004 .

[101]  J. Vermunt Latent Class Models , 2004 .

[102]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[103]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[104]  Martha E. Francis,et al.  Cognitive, Emotional, and Language Processes in Disclosure , 1996 .

[105]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[106]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[107]  Jack Block,et al.  Studying personality the long way. , 1993 .

[108]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .