Discovering User Attribute Stylistic Differences via Paraphrasing

User attribute prediction from social media text has proven successful and useful for downstream tasks. In previous studies, differences in user trait language use have been limited primarily to the presence or absence of words that indicate topical preferences. In this study, we aim to find linguistic style distinctions across three different user attributes: gender, age and occupational class. By combining paraphrases with a simple yet effective method, we capture a wide set of stylistic differences that are exempt from topic bias. We show their predictive power in user profiling, conformity with human perception and psycholinguistic hypotheses, and potential use in generating natural language tailored to specific user traits.

[1]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[2]  Benjamin Van Durme Streaming Analysis of Discourse Participants , 2012, EMNLP-CoNLL.

[3]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[4]  Jon M. Kleinberg,et al.  Echoes of power: language effects and power differences in social interaction , 2011, WWW.

[5]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[6]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[7]  J. Pennebaker,et al.  The sounds of social life: a psychometric analysis of students' daily social environments and natural conversations. , 2003, Journal of personality and social psychology.

[8]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.

[9]  Trevor Cohn,et al.  Predicting and Characterising User Impact on Twitter , 2014, EACL.

[10]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[11]  Daniel Jurafsky,et al.  He Said, She Said: Gender in the ACL Anthology , 2012, Discoveries@ACL.

[12]  Malvina Nissim,et al.  Adding Semantics to Data-Driven Paraphrasing , 2015, ACL.

[13]  Wei Xu,et al.  Gathering and Generating Paraphrases from Twitter with Application to Normalization , 2013, BUCC@ACL.

[14]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[15]  A. Paivio Imagery and verbal processes , 1972 .

[16]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[17]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[18]  Yoram Bachrach,et al.  Studying User Income through Language, Behaviour and Affect in Social Media , 2015, PloS one.

[19]  Tomoki Toda,et al.  Linguistic Individuality Transformation for Spoken Language , 2015, Natural Language Dialog Systems and Intelligent Assistants.

[20]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[21]  Carla J. Groom,et al.  Gender Differences in Language Use: An Analysis of 14,000 Text Samples , 2008 .

[22]  Christopher M. Danforth,et al.  Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter , 2011, PloS one.

[23]  Y. Trope,et al.  Construal-level theory of psychological distance. , 2010, Psychological review.

[24]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[25]  Trevor Cohn,et al.  Trendminer: An Architecture for Real Time Analysis of Social Media Text , 2012, ICWSM 2012.

[26]  James P. Bagrow,et al.  Human language reveals a universal positivity bias , 2014, Proceedings of the National Academy of Sciences.

[27]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[28]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[29]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[30]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[31]  Ani Nenkova,et al.  Inducing Lexical Style Properties for Paraphrase and Genre Differentiation , 2015, NAACL.

[32]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[33]  Marianna Apidianaki,et al.  Semantic Clustering of Pivot Paraphrases , 2014, LREC.

[34]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[35]  Chris Callison-Burch,et al.  Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation , 2011, EMNLP.

[36]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[37]  Benjamin Van Durme,et al.  Using Conceptual Class Attributes to Characterize Social Media Users , 2013, ACL.

[38]  Alon Lavie,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Korea , 2012 .

[39]  Svitlana Volkova,et al.  On Predicting Sociodemographic Traits and Emotions from Communications in Social Networks and Their Implications to Online Self-Disclosure , 2015, Cyberpsychology Behav. Soc. Netw..

[40]  Kathleen R. McKeown,et al.  Information fusion for multidocument summarization: paraphrasing and generation , 2003 .

[41]  Ralph Grishman,et al.  Paraphrasing for Style , 2012, COLING.

[42]  J. Russell Developmental psychology , 1980, Nature.

[43]  Adelaide Haas,et al.  Male and female spoken language differences: Stereotypes and evidence. , 1979 .

[44]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.