Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

Warning: this paper contains content that may be offensive or upsetting. Detecting social bias in text is challenging due to nuance, subjectivity, and difficulty in obtaining good quality labeled datasets at scale, especially given the evolving nature of social biases and society. To address these challenges, we propose a few-shot instructionbased method for prompting pre-trained language models (LMs). We select a few labelbalanced exemplars from a small support repository that are closest to the query to be labeled in the embedding space. We then provide the LM with instruction that consists of this subset of labeled exemplars, the query text to be classified, a definition of bias, and prompt it to make a decision. We demonstrate that large LMs used in a few-shot context can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models. We observe that the largest 530B parameter model is significantly more effective in detecting social bias compared to smaller models (achieving at least 20% improvement in AUC metric compared to other models). It also maintains a high AUC (dropping less than 5%) in a few-shot setting with a labeled repository reduced to as few as 100 samples. Large pretrained language models thus make it easier and quicker to build new bias detectors.

[1]  Samuel Caetano da Silva,et al.  Data Driven and Psycholinguistics Motivated Approaches to Hate Speech Detection , 2020, Computación y Sistemas.

[2]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[3]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[4]  Yejin Choi,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2020, ACL.

[5]  Fabio Petroni,et al.  Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models , 2021, FINDINGS.

[6]  Fabrício Benevenuto,et al.  A Measurement Study of Hate Speech in Social Media , 2017, HT.

[7]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[8]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[9]  Jing Zhou,et al.  Hate Speech Detection with Comment Embeddings , 2015, WWW.

[10]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[11]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[12]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[13]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[14]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[15]  Irina Illina,et al.  Towards non-toxic landscapes: Automatic toxic comment detection using DNN , 2020, TRAC@LREC.

[16]  John Pavlopoulos,et al.  Deeper Attention to Abusive User Content Moderation , 2017, EMNLP.

[17]  Chen Xing,et al.  Improving Gender Fairness of Pre-Trained Language Models without Catastrophic Forgetting , 2021, ArXiv.

[18]  Elizabeth F. Churchill,et al.  Automatic identification of personal insults on social news sites , 2012, J. Assoc. Inf. Sci. Technol..

[19]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[20]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[21]  Marco Guerini,et al.  CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech , 2019, ACL.

[22]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[23]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[24]  Fredrik Olsson,et al.  Learning Representations for Detecting Abusive Language , 2018, ALW.

[25]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[26]  Bernard J. Jansen,et al.  Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate in Online News Media , 2018, ICWSM.

[27]  Reza Zafarani,et al.  Sarcasm Detection on Twitter: A Behavioral Modeling Approach , 2015, WSDM.

[28]  S. Fiske,et al.  Controlling other people. The impact of power on stereotyping. , 1993, The American psychologist.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[31]  Derek Ruths,et al.  A Web of Hate: Tackling Hateful Speech in Online Social Spaces , 2017, ArXiv.

[32]  Eduard H. Hovy,et al.  Five sources of bias in natural language processing , 2021, Lang. Linguistics Compass.

[33]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[34]  Lyle H. Ungar,et al.  Analyzing Biases in Human Perception of User Age and Gender from Text , 2016, ACL.

[35]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[36]  Timo Schick,et al.  Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[37]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[38]  Naomi Ellemers,et al.  Thou shalt not discriminate: How emphasizing moral ideals rather than obligations increases Whites' support for social equality , 2011 .

[39]  Tommaso Caselli,et al.  HateBERT: Retraining BERT for Abusive Language Detection in English , 2020, WOAH.

[40]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[41]  Marcus Tomalin,et al.  Quarantining online hate speech: technical and ethical perspectives , 2019, Ethics and Information Technology.

[42]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[43]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[44]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[45]  Zexuan Zhong,et al.  Factual Probing Is [MASK]: Learning vs. Learning to Recall , 2021, NAACL.

[46]  Bernard J. Jansen,et al.  Online Hate Ratings Vary by Extremes: A Statistical Analysis , 2019, CHIIR.

[47]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL/IJCNLP.

[48]  Preslav Nakov,et al.  SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, FINDINGS.