Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style

We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and contentbased differences in writing between men (e.g., more profane language, numerals and beer mentions) and women (e.g., more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages.

[1]  Tatiana Litvinova,et al.  Profiling in Russian-Language Texts , 2016 .

[2]  Carla J. Groom,et al.  Gender Differences in Language Use: An Analysis of 14,000 Text Samples , 2008 .

[3]  Jurgita Kapociute-Dzikiene,et al.  Authorship Attribution and Author Profiling of Lithuanian Literary Texts , 2015, BSNLP@RANLP.

[4]  Tomaž Erjavec,et al.  JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin , 2016 .

[5]  J. Pennebaker The Secret Life of Pronouns: What Our Words Say About Us , 2011 .

[6]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[8]  Tomaz Erjavec,et al.  TweetCaT: a tool for building Twitter corpora of smaller languages , 2014, LREC.

[9]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[10]  Walter Daelemans,et al.  TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling , 2016, LREC.

[11]  Dong Nguyen,et al.  "TweetGenie: automatic age prediction from tweets" by D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder; with Ching-man Au Yeung as coordinator , 2013, LINK.

[12]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[13]  Tatiana Litvinova,et al.  Machine Learning Models of Text Categorization by Author Gender Using Topic-independent Features , 2016 .

[14]  Nada Lavrac,et al.  LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules , 2010, J. Univers. Comput. Sci..

[15]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[16]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[17]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[18]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[19]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[20]  Tomaz Erjavec,et al.  Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene , 2016, LREC.

[21]  Tatiana Litvinova,et al.  Using Part-of-Speech Sequences Frequencies in a Text to Predict Author Personality: a Corpus Study , 2015 .

[22]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[23]  Tomaz Erjavec,et al.  Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication , 2016, RASLAN.

[24]  Jurgita Kapociute-Dzikiene,et al.  Automatic Author Profiling of Lithuanian Parliamentary Speeches: Exploring the Influence of Features and Dataset Sizes , 2014, Baltic HLT.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  Paul Baker,et al.  Using Corpora to Analyze Gender , 2014 .

[27]  Darja Fiser,et al.  Private or Corporate? Predicting User Types on Twitter , 2016, NUT@COLING.

[28]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.