Using County Demographics to Infer Attributes of Twitter Users

Social media are increasingly being used to complement traditional survey methods in health, politics, and marketing. However, little has been done to adjust for the sampling bias inherent in this approach. Inferring demographic attributes of social media users is thus a critical step to improving the validity of such studies. While there have been a number of supervised machine learning approaches to this problem, these rely on a training set of users annotated with attributes, which can be difficult to obtain. We instead propose training a demographic attribute classifiers that uses county-level supervision. By pairing geolocated social media with county demographics, we build a regression model mapping text to demographics. We then adopt this model to make predictions at the user level. Our experiments using Twitter data show that this approach is surprisingly competitive with a fully supervised approach, estimating the race of a user with 80% accuracy.

[1]  Leland Gerson Neuberg,et al.  A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data , 1999 .

[2]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[3]  Andrew Gelman,et al.  Struggles with survey weighting and regression modeling , 2007, 0710.5005.

[4]  George Forman,et al.  Quantifying counts and costs via classification , 2008, Data Mining and Knowledge Discovery.

[5]  Alexander J. Smola,et al.  Estimating labels from label proportions , 2008, ICML '08.

[6]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[7]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[8]  Lars Backstrom,et al.  ePluribus: Ethnicity on Social Networks , 2010, ICWSM.

[9]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[10]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[11]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[12]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[13]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[14]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Daniel Gayo-Avello Don't turn social media into another 'Literary Digest' poll , 2011, Commun. ACM.

[17]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[18]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[19]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[20]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[21]  Mark Dredze,et al.  How Social Media Will Change Public Health , 2012, IEEE Intelligent Systems.

[22]  Sharad Goel,et al.  Who Does What on the Web: A Large-Scale Study of Browsing Behavior , 2012, ICWSM.

[23]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[24]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[25]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[26]  Aron Culotta,et al.  Estimating county health statistics with twitter , 2014, CHI.

[27]  Lakshman Krishnamurthi,et al.  Investigating the Relationship Between the Content of Online Word of Mouth, Advertising, and Brand Performance , 2014, Mark. Sci..