Recipes for Safety in Open-domain Chatbots

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.

[1]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[2]  Antonella De Angeli,et al.  Stupid computer! Abuse and social identities , 2005 .

[3]  Antonella De Angeli,et al.  I hate you! Disinhibition with virtual partners , 2008, Interact. Comput..

[4]  C. Lortie,et al.  Judgment of the Humanness of an Interlocutor Is in the Eye of the Beholder , 2011, PloS one.

[5]  W. R. Ford,et al.  Real conversations with artificial intelligence: A comparison between human-human online conversations and human-chatbot conversations , 2015, Comput. Hum. Behav..

[6]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[7]  Keith W. Miller,et al.  Why we should have seen that coming: comments on Microsoft's tay "experiment," and wider implications , 2017, CSOC.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Jiebo Luo,et al.  Detecting the Hate Code on Social Media , 2017, ICWSM.

[10]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[11]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[12]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[13]  Paula Cristina Teixeira Fortuna,et al.  Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes , 2017 .

[14]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[15]  M. J. Wolf,et al.  Why We Should Have Seen That Coming , 2017 .

[16]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[17]  Cícero Nogueira dos Santos,et al.  Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer , 2018, ACL.

[18]  Pascale Fung,et al.  Reducing Gender Bias in Abusive Language Detection , 2018, EMNLP.

[19]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[20]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[21]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[22]  Rahul Goel,et al.  Detecting Offensive Content in Open-domain Conversations using Two Stage Semi-supervision , 2018, ArXiv.

[23]  Dilek Z. Hakkani-Tür,et al.  Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize , 2018, ArXiv.

[24]  Peter Henderson,et al.  Ethical Challenges in Data-Driven Dialogue Systems , 2017, AIES.

[25]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[26]  Jason Weston,et al.  Engaging Image Chat: Modeling Personality in Grounded Dialogue , 2018, ArXiv.

[27]  Mohit Bansal,et al.  Polite Dialogue Generation Without Parallel Data , 2018, TACL.

[28]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[29]  Jason Weston,et al.  ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons , 2019, ArXiv.

[30]  Y-Lan Boureau,et al.  Zero-Shot Fine-Grained Style Transfer: Leveraging Distributed Continuous Style Representations to Transfer To Unseen Styles , 2019, ArXiv.

[31]  Verena Rieser,et al.  A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents , 2019, SIGdial.

[32]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[33]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[34]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[35]  Björn Gambäck,et al.  Studying Generalisability across Abusive Language Detection Datasets , 2019, CoNLL.

[36]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[38]  Athena Vakali,et al.  A Unified Deep Learning Architecture for Abuse Detection , 2018, WebSci.

[39]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[40]  Mun Yong Yi,et al.  Should an Agent Be Ignoring It?: A Study of Verbal Abuse Types and Conversational Agents' Response Styles , 2019, CHI Extended Abstracts.

[41]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[42]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[43]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[44]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[45]  Yulia Tsvetkov,et al.  Demoting Racial Bias in Hate Speech Detection , 2020, SOCIALNLP.

[46]  Christopher D. Manning,et al.  Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations , 2020, ArXiv.

[47]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[48]  J. Weston,et al.  Queens Are Powerful Too: Mitigating Gender Bias in Dialogue Generation , 2019, EMNLP.

[49]  Ralf Krestel,et al.  Offensive Language Detection Explained , 2020, TRAC.

[50]  Tommaso Caselli,et al.  I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language , 2020, LREC.

[51]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[52]  Antoine Bordes,et al.  Image-Chat: Engaging Grounded Conversations , 2018, ACL.

[53]  Prashanth Bhat,et al.  Covert Hate Speech: White Nationalists and Dog Whistle Communication on Twitter , 2020 .

[54]  Zhiwei Wang,et al.  Chat as Expected: Learning to Manipulate Black-box Neural Dialogue Models , 2020, ArXiv.

[55]  Jiliang Tang,et al.  Does Gender Matter? Towards Fairness in Dialogue Systems , 2019, COLING.

[56]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[57]  Mary Williamson,et al.  Can You Put it All Together: Evaluating Conversational Agents’ Ability to Blend Skills , 2020, ACL.

[58]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, Annual Meeting of the Association for Computational Linguistics.

[59]  Y-Lan Boureau,et al.  Controlling Style in Generated Dialogue , 2020, ArXiv.

[60]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[61]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[62]  Jason Weston,et al.  Multi-Dimensional Gender Bias Classification , 2020, EMNLP.

[63]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[64]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[65]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[66]  Mun Yong Yi,et al.  Empathy Is All You Need: How a Conversational Agent Should Respond to Verbal Abuse , 2020, CHI.

[67]  M. de Rijke,et al.  Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data and Methodology , 2020, ArXiv.

[68]  Shafiq R. Joty,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.

[69]  Oguzhan Gencoglu,et al.  Cyberbullying Detection With Fairness Constraints , 2020, IEEE Internet Computing.

[70]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[71]  Patrick Pantel,et al.  Preserving integrity in online social networks , 2020, Commun. ACM.