Towards generalisable hate speech detection: a review on obstacles and solutions

Hate speech is one type of harmful online content which directly attacks or promotes hate towards a group or an individual member based on their actual or perceived aspects of identity, such as ethnicity, religion, and sexual orientation. With online hate speech on the rise, its automatic detection as a natural language processing task is gaining increasing interest. However, it is only recently that it has been shown that existing models generalise poorly to unseen data. This survey paper attempts to summarise how generalisable existing hate speech detection models are and the reasons why hate speech models struggle to generalise, sums up existing attempts at addressing the main obstacles, and then proposes directions of future research to improve generalisation in hate speech detection.

[1]  Marco Guerini,et al.  CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech , 2019, ACL.

[2]  Helen Yannakoudakis,et al.  Neural Character-based Composition Models for Abuse Detection , 2018, ALW.

[3]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[4]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[5]  Endang Wahyu Pamungkas,et al.  Cross-domain and Cross-lingual Abusive Language Detection: A Hybrid Approach with Deep Learning and a Multilingual Lexicon , 2019, ACL.

[6]  Paula Fortuna,et al.  How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets? , 2021, Inf. Process. Manag..

[7]  Yaser Al-Onaizan,et al.  Neural Word Decomposition Models for Abusive Language Detection , 2019, ArXiv.

[8]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Paolo Rosso,et al.  AMI @ EVALITA2020: Automatic Misogyny Identification , 2020, EVALITA.

[11]  Björn Gambäck,et al.  Studying Generalisability across Abusive Language Detection Datasets , 2019, CoNLL.

[12]  Xiang Ren,et al.  Contextualizing Hate Speech Classifiers with Post-hoc Explanation , 2020, ACL.

[13]  Hind Saleh Alatawi,et al.  Detecting White Supremacist Hate Speech Using Domain Specific Word Embedding With Deep Learning and BERT , 2020, IEEE Access.

[14]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[15]  Valerio Basile,et al.  It's the End of the Gold Standard as we Know it. On the Impact of Pre-aggregation on the Evaluation of Highly Subjective Tasks , 2020, DP@AI*IA.

[16]  Joachim Bingel,et al.  Bridging the Gaps: Multi Task Learning for Domain Transfer of Hate Speech Detection , 2018 .

[17]  Rui Zhao,et al.  Automatic detection of cyberbullying on social networks based on bullying features , 2016, ICDCN.

[18]  Yue Ning,et al.  Empirical Analysis of Multi-Task Learning for Reducing Identity Bias in Toxic Comment Detection , 2020, ICWSM.

[19]  Viviana Patti,et al.  Resources and benchmark corpora for hate speech detection: a systematic review , 2020, Language Resources and Evaluation.

[20]  Paolo Rosso,et al.  Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI) , 2018, EVALITA@CLiC-it.

[21]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[22]  Amit Awekar,et al.  Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms , 2018, ECIR.

[23]  Taha Yasseri,et al.  Detecting weak and strong Islamophobic hate speech on social media , 2018, Journal of Information Technology & Politics.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Goran Glavaš,et al.  XHate-999: Analyzing and Detecting Abusive Language Across Domains and Languages , 2020, COLING.

[26]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[27]  Viviana Patti,et al.  HurtBERT: Incorporating Lexical Features with BERT for the Detection of Abusive Language , 2020, ALW.

[28]  Francesca Gasparini,et al.  Detecting Sexist MEME On The Web: A Study on Textual and Visual Cues , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[29]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[30]  Benjamin Heinzerling,et al.  BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages , 2017, LREC.

[31]  Pascale Fung,et al.  Reducing Gender Bias in Abusive Language Detection , 2018, EMNLP.

[32]  Lei Gao,et al.  Recognizing Explicit and Implicit Hate Speech Using a Weakly Supervised Two-path Bootstrapping Approach , 2017, IJCNLP.

[33]  Xiang Ren,et al.  Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2020, ICLR.

[34]  Ritesh Kumar,et al.  Aggression-annotated Corpus of Hindi-English Code-mixed Data , 2018, LREC.

[35]  Pete Burnap,et al.  The Enemy Among Us , 2013, ACM Trans. Web.

[36]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[37]  Brendan T. O'Connor,et al.  Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English , 2017, ArXiv.

[38]  Viviana Patti,et al.  Misogyny Detection in Twitter: a Multilingual and Cross-Domain Study , 2020, Inf. Process. Manag..

[39]  Sandra Kübler,et al.  Investigating Sampling Bias in Abusive Language Detection , 2020, ALW.

[40]  Dit-Yan Yeung,et al.  Comparative Evaluation of Label Agnostic Selection Bias in Multilingual Hate Speech Datasets , 2020, EMNLP.

[41]  Leon Derczynski,et al.  Directions in Abusive Language Training Data: Garbage In, Garbage Out , 2020, ArXiv.

[42]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[43]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[44]  Yue Ning,et al.  Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection , 2019, ArXiv.

[45]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[46]  Jan Snajder,et al.  Cross-Domain Detection of Abusive Language Online , 2018, ALW.

[47]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[48]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[49]  Josef Ruppenhofer,et al.  Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations , 2020, Language Resources and Evaluation.

[50]  Noel Crespi,et al.  Hate speech detection and racial bias mitigation in social media based on BERT model , 2020, PloS one.

[51]  Tommaso Caselli,et al.  I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language , 2020, LREC.

[52]  Maite Taboada,et al.  The SFU Opinion and Comments Corpus: A Corpus for the Analysis of Online News Comments , 2019, Corpus pragmatics : international journal of corpus linguistics and pragmatics.

[53]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[54]  Mauro Conti,et al.  All You Need is "Love": Evading Hate Speech Detection , 2018, ArXiv.

[55]  Yulia Tsvetkov,et al.  Demoting Racial Bias in Hate Speech Detection , 2020, SOCIALNLP.

[56]  Rui Cao,et al.  DeepHate: Hate Speech Detection via Multi-Faceted Text Representations , 2020, WebSci.

[57]  Hao Chen,et al.  A Comparison of Classical Versus Deep Learning Techniques for Abusive Content Detection on Social Media Sites , 2018, SocInfo.

[58]  Vasudeva Varma,et al.  FERMI at SemEval-2019 Task 5: Using Sentence embeddings to Identify Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[59]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[60]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[61]  Hao Chen,et al.  The Use of Deep Learning Distributed Representations in the Identification of Abusive Text , 2019, ICWSM.

[62]  Scott A. Hale,et al.  Detecting East Asian Prejudice on Social Media , 2020, ALW.

[63]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[64]  Liviu P. Dinu,et al.  On Transfer Learning for Detecting Abusive Language Online , 2019, IWANN.

[65]  Jorge Pérez,et al.  Hate speech detection is not as easy as you may think: A closer look at model validation (extended version) , 2020, Inf. Syst..

[66]  Jiebo Luo,et al.  Determining Code Words in Euphemistic Hate Speech Using Word Embedding Networks , 2018, ALW.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[69]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[70]  Michele Banko,et al.  A Unified Taxonomy of Harmful Content , 2020, ALW.

[71]  Noel Crespi,et al.  A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media , 2019, COMPLEX NETWORKS.

[72]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[73]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[74]  Natasha Duarte,et al.  Mixed Messages? The Limits of Automated Social Media Content Analysis , 2018, FAT.

[75]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[76]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[77]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[78]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[79]  Lei Gao,et al.  Detecting Online Hate Speech Using Context Aware Models , 2017, RANLP.

[80]  Tommaso Caselli,et al.  HateBERT: Retraining BERT for Abusive Language Detection in English , 2020, WOAH.

[81]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.

[82]  Carlos Ortiz,et al.  Intersectional Bias in Hate Speech and Abusive Language Datasets , 2020, ArXiv.

[83]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[84]  Matthew Leighton Williams,et al.  The Enemy Among Us: Detecting Hate Speech with Threats Based 'Othering' Language Embeddings , 2018 .

[85]  Nikos Pelekis,et al.  DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis , 2017, *SEMEVAL.

[86]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[87]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[88]  Aida Mostafazadeh Davani,et al.  The Gab Hate Corpus: A collection of 27k posts annotated for hate speech , 2018 .

[89]  Paula Fortuna,et al.  Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets , 2020, LREC.

[90]  Elisabetta Fersini,et al.  Unintended Bias in Misogyny Detection , 2019, 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[91]  Franck Dernoncourt,et al.  Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition , 2020, LREC.

[92]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[93]  Cody Buntain,et al.  A Large Labeled Corpus for Online Harassment Research , 2017, WebSci.

[94]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[95]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[96]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[97]  Ji Ho Park,et al.  Finding Good Representations of Emotions for Text Classification , 2018, ArXiv.

[98]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[99]  Yi-Shin Chen,et al.  Surfacing contextual hate speech words within social media , 2017, ArXiv.

[100]  Emily Ahn,et al.  Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[101]  Michael Wiegand,et al.  Detection of Abusive Language: the Problem of Biased Datasets , 2019, NAACL.

[102]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[103]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[104]  A. Al-Hassan,et al.  DETECTION OF HATE SPEECH IN SOCIAL NETWORKS: A SURVEY ON MULTILINGUAL CORPUS , 2019, Computer Science & Information Technology(CS & IT).

[105]  Shubhanshu Mishra,et al.  3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages , 2019, FIRE.

[106]  Helen Yannakoudakis,et al.  Tackling Online Abuse: A Survey of Automated Abuse Detection Methods , 2019, ArXiv.

[107]  Gianluca Stringhini,et al.  Class-based Prediction Errors to Detect Hate Speech with Out-of-vocabulary Words , 2017, ALW@ACL.

[108]  Paolo Rosso,et al.  Overview of the Task on Automatic Misogyny Identification at IberEval 2018 , 2018, IberEval@SEPLN.

[109]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[110]  Vasudeva Varma,et al.  Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations , 2019, WWW.

[111]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[112]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[113]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[114]  Liang Zou,et al.  NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers , 2019, *SEMEVAL.

[115]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.

[116]  Mai ElSherief,et al.  Leveraging Intra-User and Inter-User Representation Learning for Automated Hate Speech Detection , 2018, NAACL.

[117]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[118]  Björn Gambäck,et al.  A Platform Agnostic Dual-Strand Hate Speech Detector , 2019 .

[119]  Stan Matwin,et al.  Offensive Language Detection Using Multi-level Classification , 2010, Canadian Conference on AI.

[120]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[121]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[122]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[123]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[124]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[125]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[126]  Kyomin Jung,et al.  Comparative Studies of Detecting Abusive Language on Twitter , 2018, ALW.

[127]  Manish Shrivastava,et al.  Degree based Classification of Harmful Speech using Twitter Data , 2018, TRAC@COLING 2018.

[128]  Ona de Gibert,et al.  Hate Speech Dataset from a White Supremacy Forum , 2018, ALW.

[129]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[130]  Rachael Tatman,et al.  Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.