Predicting and Analyzing Language Specificity in Social Media Posts

In computational linguistics, specificity quantifies how much detail is engaged in text. It is an important characteristic of speaker intention and language style, and is useful in NLP applications such as summarization and argumentation mining. Yet to date, expert-annotated data for sentence-level specificity are scarce and confined to the news genre. In addition, systems that predict sentence specificity are classifiers trained to produce binary labels (general or specific).We collect a dataset of over 7,000 tweets annotated with specificity on a fine-grained scale. Using this dataset, we train a supervised regression model that accurately estimates specificity in social media posts, reaching a mean absolute error of 0.3578 (for ratings on a scale of 1-5) and 0.73 Pearson correlation, significantly improving over baselines and previous sentence specificity prediction systems. We also present the first large-scale study revealing the social, temporal and mental health factors underlying language specificity on social media.

[1]  Junyi Jessy Li,et al.  Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media , 2018, COLING.

[2]  Danushka Bollegala,et al.  Frustratingly Easy Meta-Embedding - Computing Meta-Embeddings by Averaging Source Word Embeddings , 2018, NAACL-HLT.

[3]  Bruno Verschuere,et al.  Using Named Entities for Computer‐Automated Verbal Deception Detection , 2017, Journal of forensic sciences.

[4]  Junyi Jessy Li,et al.  Why Swear? Analyzing and Inferring the Intentions of Vulgar Expressions , 2018, EMNLP.

[5]  Luca Lugini,et al.  Predicting Specificity in Classroom Discussion , 2017, BEA@EMNLP.

[6]  Lyle H. Ungar,et al.  Beyond Binary Labels: Political Ideology Prediction of Twitter Users , 2017, ACL.

[7]  Lyle H. Ungar,et al.  Exploring Stylistic Variation with Age and Income on Twitter , 2016, ACL.

[8]  Terry K Koo,et al.  A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. , 2016, Journal Chiropractic Medicine.

[9]  Ian P. Cook,et al.  Content and context: Three essays on information in politics , 2016 .

[10]  Junyi Jessy Li,et al.  The Instantiation Discourse Relation: A Corpus Analysis of Its Properties and Improved Detection , 2016, NAACL.

[11]  Junyi Jessy Li,et al.  Improving the Annotation of Sentence Specificity , 2016, LREC.

[12]  Diane J. Litman,et al.  Determining the Quality of a Student Reflective Response , 2016, FLAIRS Conference.

[13]  Yoram Bachrach,et al.  Studying User Income through Language, Behaviour and Affect in Social Media , 2015, PloS one.

[14]  Brian Ecker,et al.  Argument Mining: Extracting Arguments from Online Dialogue , 2015, SIGDIAL Conference.

[15]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[16]  Maarten Sap,et al.  Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task , 2015, CLPsych@HLT-NAACL.

[17]  Junyi Jessy Li,et al.  Fast and Accurate Prediction of Sentence Specificity , 2015, AAAI.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[20]  高橋 栄 Diagnostic and Statistical Manual of Mental Disorders(DSM)-5による分類と診断 (特集 周産期メンタルヘルス : 妊婦の不安とどう立ち向かうか) , 2014 .

[21]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[22]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[23]  Christopher Ellis Public Ideology and Political Dynamics in the United States , 2012 .

[24]  Sven Lauer,et al.  Modeling Expert Effects and Common Ground Using Questions Under Discussion , 2011, AAAI Fall Symposium: Building Representations of Common Ground with Intelligent Agents.

[25]  Ani Nenkova,et al.  Automatic identification of general and specific sentences by leveraging discourse annotations , 2011, IJCNLP.

[26]  A. Beck,et al.  Beck Depression Inventory–II , 2011 .

[27]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[28]  Ani Nenkova,et al.  Text Specificity and Impact on Quality of News Summaries , 2011, Monolingual@ACL.

[29]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[30]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[31]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[32]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[33]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[34]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[35]  J. Pennebaker,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Words of Wisdom: Language Use Over the Life Span , 2003 .

[36]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[37]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[38]  P. Dixon The processing of organizational and component step information in written directions , 1987 .

[39]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .