Lifetime Lexical Variation in Social Media

As the rapid growth of online social media attracts a large number of Internet users, the large volume of content generated by these users also provides us with an opportunity to study the lexical variation of people of different ages. In this paper, we present a latent variable model that jointly models the lexical content of tweets and Twitter users' ages. Our model inherently assumes that a topic has not only a word distribution but also an age distribution. We propose a Gibbs-EM algorithm to perform inference on our model. Empirical evaluation shows that our model can learn meaningful age-specific topics such as "school" for teenagers and "health" for older people. Our model can also be used for age prediction and performs better than a number of baseline methods.

[1]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[2]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[3]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[4]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[5]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[8]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[9]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[10]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[11]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[12]  J. L. Fischer Social Influences on the Choice of a Linguistic Variant , 1958 .

[13]  Feng Liang,et al.  Exploiting real-time information retrieval in the microblogosphere , 2012, JCDL '12.

[14]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[15]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[18]  Ee-Peng Lim,et al.  Finding Bursty Topics from Microblogs , 2012, ACL.

[19]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[20]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[21]  Feida Zhu,et al.  It Is Not Just What We Say, But How We Say Them: LDA-based Behavior-Topic Model , 2013, SDM.