All You Need is "Love": Evading Hate Speech Detection

With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective - a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the attacks, and using character-level features makes the models systematically more attack-resistant than using word-level features.

[1]  Jon Andoni Duñabeitia,et al.  R34D1NG W0RD5 W1TH NUMB3R5. , 2008, Journal of experimental psychology. Human perception and performance.

[2]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[3]  Njagi Dennis Gitari,et al.  A Lexicon-based Approach for Hate Speech Detection , 2015, MUE 2015.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Ying Chen,et al.  Detecting Offensive Language in Social Media to Protect Adolescent Online Safety , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[8]  Sarah J. White,et al.  Raeding Wrods With Jubmled Lettres , 2006, Psychological science.

[9]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[10]  Henry Lieberman,et al.  Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying , 2012, TIIS.

[11]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[12]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[13]  Yan Zhou,et al.  Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning , 2007 .

[14]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[15]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[16]  GoldbergYoav A primer on neural network models for natural language processing , 2016 .

[17]  Yan Zhou,et al.  Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[18]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[19]  Joel R. Tetreault,et al.  Do Characters Abuse More Than Words? , 2016, SIGDIAL Conference.

[20]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[21]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[22]  Christopher Meek,et al.  Good Word Attacks on Statistical Spam Filters , 2005, CEAS.

[23]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[24]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[25]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[26]  Neil Shah,et al.  False Information on Web and Social Media: A Survey , 2018, ArXiv.

[27]  Mangal Sain,et al.  Survey on malware evasion techniques: State of the art and challenges , 2012, 2012 14th International Conference on Advanced Communication Technology (ICACT).

[28]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[29]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[30]  Alexander Brown,et al.  What is hate speech? Part 1: The Myth of Hate , 2017 .

[31]  Matthew Leighton Williams,et al.  Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making , 2015 .