Empath: Understanding Topic Signals in Large-Scale Text

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

[1]  Wiltrud Kessler Turney: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classication of Reviews , 2012 .

[2]  Jeffrey T. Hancock,et al.  Experimental evidence of massive-scale emotional contagion through social networks , 2014, Proceedings of the National Academy of Sciences.

[3]  Jure Leskovec,et al.  A computational approach to politeness with application to social factors , 2013, ACL.

[4]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Maneesh Agrawala,et al.  Generating emotionally relevant musical scores for audio stories , 2014, UIST.

[7]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[8]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[9]  Mitsuru Ishizuka,et al.  Narrowing the Social Gap among People Involved in Global Dialog: Automatic Emotion Detection in Blog Posts , 2007, ICWSM.

[10]  Alice H. Oh,et al.  Do You Feel What I Feel? Social Aspects of Emotions in Twitter Conversations , 2012, ICWSM.

[11]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[12]  B. Loader,et al.  What the hashtag? A content analysis of Canadian politics on Twitter TAMARA A . SMALL , 2012 .

[13]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[14]  Joel D. Martin,et al.  Sentiment, emotion, purpose, and style in electoral tweets , 2015, Inf. Process. Manag..

[15]  Ranjitha Kumar,et al.  Webzeitgeist: design mining the web , 2013, CHI.

[16]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[17]  Froma I. Zeitlin,et al.  Playing the Other: Gender and Society in Classical Greek Literature , 1995 .

[18]  Karrie Karahalios,et al.  DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization , 2015, UIST.

[19]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[20]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[21]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .

[22]  Eric Gilbert,et al.  Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk , 2015, CHI.

[23]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[24]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[25]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[26]  Michael S. Bernstein,et al.  We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers , 2015, CHI.

[27]  Munmun De Choudhury,et al.  You're happy, I'm happy: diffusion of mood expression on twitter , 2014 .

[28]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[29]  Emre Kiciman,et al.  Towards Learning a Knowledge Base of Actions from Experiential Microblogs , 2015, AAAI Spring Symposia.

[30]  Saif Mohammad,et al.  Generating Music from Literature , 2014, CLfL@EACL.

[31]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[32]  Cristian Danescu-Niculescu-Mizil,et al.  Linguistic Harbingers of Betrayal: A Case Study on an Online Strategy Game , 2015, ACL.

[33]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[34]  Michael S. Bernstein,et al.  Emergent, crowd-scale programming practice in the IDE , 2014, CHI.

[35]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[36]  Jeffrey T. Hancock,et al.  Separating Fact From Fiction: An Examination of Deceptive Self-Presentation in Online Dating Profiles , 2008, Personality & social psychology bulletin.

[37]  Jonathan Harris,et al.  We feel fine and searching the emotional web , 2011, WSDM '11.

[38]  P. Shaver,et al.  Emotion knowledge: further exploration of a prototype approach. , 1987, Journal of personality and social psychology.

[39]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[40]  Peter Kulchyski and , 2015 .

[41]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[42]  Weiran Xu,et al.  Learning Word Vectors Efficiently Using Shared Representations and Document Representations , 2015, AAAI.

[43]  Michael S. Bernstein,et al.  Augur: Mining Human Behaviors from Fiction to Power Interactive Systems , 2016, CHI.

[44]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[45]  Wolfgang Wahlster,et al.  Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics , 1997 .

[46]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[47]  Vasileios Hatzivassiloglou,et al.  Predicting the Semantic Orientation of Adjectives , 1997, ACL.

[48]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[49]  Tamara A. Small WHAT THE HASHTAG? , 2011 .