论文信息 - Datasets of Slovene and Croatian Moderated News Comments

Datasets of Slovene and Croatian Moderated News Comments

This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.

Tomaž Erjavec | Darja Fišer | Nikola Ljubešić

[1] Marco Cuturi,et al. On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests , 2015, Entropy.

[2] Ingmar Weber,et al. Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[3] Björn Ross,et al. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[4] F. Massey. The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[5] John Pavlopoulos,et al. Deeper Attention to Abusive User Content Moderation , 2017, EMNLP.

[6] Lucas Dixon,et al. Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[7] Dirk Hovy,et al. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[8] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.