TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling

Personality profiling is the task of detecting personality traits of authors based on writing style. Several personality typologies exist, however, the Briggs-Myer Type Indicator (MBTI) is particularly popular in the non-scientific community, and many people use it to analyse their own personality and talk about the results online. Therefore, large amounts of self-assessed data on MBTI are readily available on social-media platforms such as Twitter. We present a novel corpus of tweets annotated with the MBTI personality type and gender of their author for six Western European languages (Dutch, German, French, Italian, Portuguese and Spanish). We outline the corpus creation and annotation, show statistics of the obtained data distributions and present first baselines on Myers-Briggs personality profiling and gender prediction for all six languages.

[1]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.

[2]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[3]  Walter Daelemans,et al.  CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text , 2014, LREC.

[4]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[5]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[6]  Philip S. Yu,et al.  Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter , 2013, 2013 12th International Conference on Machine Learning and Applications.

[7]  Valerie Priscilla Goby,et al.  Personality and Online/Offline Choices: MBTI Profiles and Favored Communication Modes in a Singapore Study , 2006, Cyberpsychology Behav. Soc. Netw..

[8]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[9]  A. Tellegen,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES An Alternative "Description of Personality": The Big-Five Factor Structure , 2022 .

[10]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[11]  Fabio Pianesi,et al.  The Workshop on Computational Personality Recognition 2014 , 2014, ACM Multimedia.

[12]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[13]  Walter Daelemans,et al.  Creating TwiSty: Corpus Development and Statistics , 2016 .

[14]  Scott Nowson,et al.  Look! Who's Talking?: Projection of Extraversion Across Different Social Contexts , 2014, WCPR '14.

[15]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[16]  Myers,et al.  Gifts Differing: Understanding Personality Type , 1980 .

[17]  Walter Daelemans,et al.  Personae: a Corpus for Author and Personality Prediction from Text , 2008, LREC.

[18]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[19]  Caroline Brun,et al.  Motivating Personality-aware Machine Translation , 2015, EMNLP.

[20]  Marie-Francine Moens,et al.  Computational personality recognition in social media , 2016, User Modeling and User-Adapted Interaction.

[21]  Gregory J. Park,et al.  Automatic personality assessment through social media language. , 2015, Journal of personality and social psychology.

[22]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[23]  Fabio Pianesi,et al.  Workshop on Computational Personality Recognition: Shared Task , 2013, Proceedings of the International AAAI Conference on Web and Social Media.

[24]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[25]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[26]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[27]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[28]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[29]  Daniel Gatica-Perez,et al.  The YouTube Lens: Crowdsourced Personality Impressions and Audiovisual Analysis of Vlogs , 2013, IEEE Transactions on Multimedia.

[30]  Maarten Sap,et al.  The role of personality, age, and gender in tweeting about mental illness , 2015, CLPsych@HLT-NAACL.

[31]  Eduardo Blanco,et al.  Toward Personality Insights from Language Exploration in Social Media , 2013, AAAI Spring Symposium: Analyzing Microtext.