论文信息 - A Large Labeled Corpus for Online Harassment Research

A Large Labeled Corpus for Online Harassment Research

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.

[1] Kelly Reynolds,et al. Detecting cyberbullying: query terms and techniques , 2013, WebSci.

[2] Maeve Duggan,et al. Social Media Update 2016 , 2016 .

[3] Claire Hardaker,et al. Trolling in asynchronous computer-mediated communication:from user discussions to theoretical concepts , 2010 .

[4] Elizabeth F. Churchill,et al. Automatic identification of personal insults on social news sites , 2012, J. Assoc. Inf. Sci. Technol..

[5] D. Paulhus,et al. Trolls just want to have fun , 2014 .

[6] Thomas Wöhner,et al. Detecting Online Harassment in Social Networks , 2014, ICIS.

[7] C. Hardaker,et al. Trolling in asynchronous computer-mediated communication: From user discussions to academic definitions , 2010 .