Developing a Multilingual Annotated Corpus of Misogyny and Aggression

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

[1]  Stan Matwin,et al.  When a Tweet is Actually Sexist. A more Comprehensive Classification of Different Online Harassment Categories and The Challenges in NLP , 2019, ArXiv.

[2]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[3]  Paolo Rosso,et al.  Online Hate Speech against Women: Automatic Identification of Misogyny and Sexism on Twitter , 2019, J. Intell. Fuzzy Syst..

[4]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[5]  Paolo Rosso,et al.  Automatic Identification and Classification of Misogynistic Language on Twitter , 2018, NLDB.

[6]  Ritesh Kumar,et al.  Aggression-annotated Corpus of Hindi-English Code-mixed Data , 2018, LREC.

[7]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[8]  Michael Wiegand,et al.  Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language , 2018 .

[9]  Paolo Rosso,et al.  Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI) , 2018, EVALITA@CLiC-it.

[10]  Paolo Rosso,et al.  Overview of the Task on Automatic Misogyny Identification at IberEval 2018 , 2018, IberEval@SEPLN.

[11]  Shervin Malmasi,et al.  Detecting Hate Speech in Social Media , 2017, RANLP.

[12]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[13]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[14]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[15]  Wulczyn Ellery,et al.  Wikipedia Talk Labels: Toxicity , 2017 .

[16]  Ashish Sureka,et al.  Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website , 2017, ArXiv.

[17]  Luis Gerardo Mojica Modeling Trolling in Social Media Conversations , 2016, LREC.

[18]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[19]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[20]  Thanassis Tiropanis,et al.  The problem of identifying misogynist language on Twitter (and other online social spaces) , 2016, WebSci.

[21]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[22]  S. Sax Flame Wars : Automatic Insult Detection , 2016 .

[23]  Preslav Nakov,et al.  Finding Opinion Manipulation Trolls in News Community Forums , 2015, CoNLL.

[24]  Matthew Leighton Williams,et al.  Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making , 2015 .

[25]  Jing Zhou,et al.  Hate Speech Detection with Comment Embeddings , 2015, WWW.

[26]  Ashish Sureka,et al.  Using KNN and SVM Based One-Class Classifier for Detecting Online Radicalization on Twitter , 2015, ICDCIT.

[27]  A. Flammini,et al.  Misogynistic Language on Twitter and Sexual Violence , 2015 .

[28]  V. S. Subrahmanian,et al.  Accurately detecting trolls in Slashdot Zoo via decluttering , 2014, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014).

[29]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[30]  Dolf Trieschnigg,et al.  Improving Cyberbullying Detection with User Context , 2013, ECIR.

[31]  N. Lorenzo-Dus Relational work in anonymous,asynchronous communication:A study of (dis)affiliation in YouTube , 2013 .

[32]  Patricia Bou-Franch,et al.  Social Interaction in YouTube Text-Based Polylogues: A Study of Coherence , 2012, J. Comput. Mediat. Commun..

[33]  Jun-Ming Xu,et al.  Learning from Bullying Traces in Social Media , 2012, NAACL.

[34]  Nitin,et al.  Classification of Flames in Computer Mediated Communications , 2011, ArXiv.

[35]  Patricia Bou-Franch,et al.  On-line polylogues and impoliteness: The case of postings sent in response to the Obama Reggaeton Yo , 2011 .

[36]  P. Blitvich The YouTubification of Politics, Impoliteness and Polarization , 2010 .

[37]  E. Cambria,et al.  Do Not Feel The Trolls , 2010 .

[38]  Alan F. Smeaton,et al.  Classifying racist texts using a support vector machine , 2004, SIGIR '04.

[39]  Edel Greevy,et al.  Automatic text categorisation of racist webpages , 2004 .