A Large Labeled Corpus for Online Harassment Research

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.