Studying User Income through Language, Behaviour and Affect in Social Media

Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.

[1]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[2]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[3]  Basil Bernstein,et al.  Applied Studies Towards a Sociology of Language , 1970 .

[4]  D. Block,et al.  Language and Social Class , 2020, The International Encyclopedia of Linguistic Anthropology.

[5]  Francine D. Blau,et al.  The Gender Earnings Gap: Learning from International Comparisons , 1992 .

[6]  P. Ekman An argument for basic emotions , 1992 .

[7]  D. J. Allerton,et al.  Book Review: GPS theory and practice. Second Edition, HOFFMANNWELLENHOFF B., LICHTENEGGER H. and COLLINS J., 1993, 326 pp., Springer, £31.00 pb, ISBN 3-211-82477-4 , 1995 .

[8]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  C. A. Higgins,et al.  THE BIG FIVE PERSONALITY TRAITS, GENERAL MENTAL ABILITY, AND CAREER SUCCESS ACROSS THE LIFE SPAN , 1999 .

[11]  E. Diener,et al.  Will Money Increase Subjective Well-Being? , 2002 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  Methodology for the 2004 Annual Survey of Hours and Earnings , 2004 .

[15]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[16]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[17]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[18]  A. Caspi,et al.  The Power of Personality: The Comparative Validity of Personality Traits, Socioeconomic Status, and Cognitive Ability for Predicting Important Life Outcomes , 2007, Perspectives on psychological science : a journal of the Association for Psychological Science.

[19]  James K. Harter,et al.  Affluence, Feelings of Stress, and Well-being , 2009 .

[20]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[21]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[22]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[23]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[24]  Peter Elias,et al.  SOC2010: revision of the Standard Occupational Classification , 2010 .

[25]  D. Kahneman,et al.  High income improves evaluation of life but not emotional well-being , 2010, Proceedings of the National Academy of Sciences.

[26]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[27]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[28]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Ed H. Chi,et al.  Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[31]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[32]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[33]  Trevor Cohn,et al.  Trendminer: An Architecture for Real Time Analysis of Social Media Text , 2012, ICWSM 2012.

[34]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[35]  Nello Cristianini,et al.  Nowcasting Events from the Social Web with Statistical Learning , 2012, TIST.

[36]  Kalina Bontcheva,et al.  Where's @wally?: a classification approach to geolocating users based on their social ties , 2013, HT.

[37]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[38]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[39]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[40]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[41]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[42]  Trevor Cohn,et al.  Predicting and Characterising User Impact on Twitter , 2014, EACL.

[43]  Benjamin Van Durme,et al.  I'm a Belieber: Social Roles via Self-identification and Conceptual Attributes , 2014, ACL.

[44]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.

[45]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[46]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[47]  Mark Dredze,et al.  Quantifying Mental Health Signals in Twitter , 2014, CLPsych@ACL.

[48]  Dong Nguyen,et al.  Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment , 2014, COLING.

[49]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[50]  Jacob Eisenstein,et al.  Confounds and Consequences in Geotagged Twitter Data , 2015, EMNLP.

[51]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[52]  M. Kosinski,et al.  Computer-based personality judgments are more accurate than those made by humans , 2015, Proceedings of the National Academy of Sciences.

[53]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[54]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[55]  Noah A. Smith,et al.  Conference on Empirical Methods in Natural Language Processing EMNLP 2016 , 2016 .