Analyzing the language of food on social media

We investigate the predictive power behind the language of food on social media. We collect a corpus of over three million food-related posts from Twitter and demonstrate that many latent population characteristics can be directly predicted from this data: overweight rate, diabetes rate, political leaning, and home geographical location of authors. For all tasks, our language-based models significantly outperform the majority-class baselines. Performance is further improved with more complex natural language processing, such as topic modeling. We analyze which textual features have greatest predictive power for these datasets, providing insight into the connections between the language of food, geographic locale, and community characteristics. Lastly, we design and implement an online system for real-time query and visualization of the dataset. Visualization tools, such as geo-referenced heatmaps and temporal histograms, allow us to discover more complex, global patterns mirrored in the language of food.

[1]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[2]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[3]  Stephen G. Kobourov,et al.  Experimental Comparison of Semantic Word Clouds , 2014, SEA.

[4]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[5]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[6]  Stephen Kobourov,et al.  Collection and Visualization of Dietary Behavior and Reasons for Eating Using Twitter , 2013, Journal of medical Internet research.

[7]  W. Chapman,et al.  Using Twitter to Examine Smoking Behavior and Perceptions of Emerging Tobacco Products , 2013, Journal of medical Internet research.

[8]  Blesson Varghese,et al.  The royal birth of 2013: Analysing and visualising public sentiment in the UK using Twitter , 2013, 2013 IEEE International Conference on Big Data.

[9]  Eric P. Xing,et al.  A Mixture Model of Demographic Lexical Variation , 2011 .

[10]  Elad Yom-Tov,et al.  Detecting Disease Outbreaks in Mass Gatherings Using Internet Data Monitoring , 2015 .

[11]  Noah A. Smith,et al.  Narrative framing of consumer sentiment in online restaurant reviews , 2014, First Monday.

[12]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[13]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[14]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[15]  Ivan Hernandez,et al.  Happy Tweets , 2014 .

[16]  Emre Kiciman,et al.  OMG, I Have to Tweet that! A Study of Factors that Influence Tweet Rates , 2012, ICWSM.

[17]  Brendan T. O'Connor,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Craig MacDonald,et al.  Scalable distributed event detection for Twitter , 2013, 2013 IEEE International Conference on Big Data.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[22]  Daniel Jurafsky,et al.  It’s Not You, it’s Me: Detecting Flirting and its Misperception in Speed-Dates , 2009, EMNLP.

[23]  Jon-Kar Zubieta,et al.  Real-Time Sharing and Expression of Migraine Headache Suffering on Twitter: A Cross-Sectional Infodemiology Study , 2014, Journal of medical Internet research.

[24]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[25]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[26]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.