Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese*, Nat McAleese*, Maja Trebacz*, John Aslanides*, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks and Geoffrey Irving *Equal contributions, all affiliations DeepMind

[1]  William S. Isaac,et al.  Power to the People? Opportunities and Challenges for Participatory AI , 2022, EAAMO.

[2]  Eric Michael Smith,et al.  BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage , 2022, ArXiv.

[3]  Raphael Gontijo Lopes,et al.  Language Model Cascades , 2022, ArXiv.

[4]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[5]  Lisa Anne Hendricks,et al.  Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models , 2022, ArXiv.

[6]  Jeff Wu,et al.  Self-critiquing models for assisting human evaluators , 2022, ArXiv.

[7]  Majid Yazdani,et al.  Policy Compliance Detection via Expression Tree Inference , 2022, ArXiv.

[8]  Cyprien de Masson d'Autume,et al.  StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models , 2022, ICML.

[9]  I. Higgins,et al.  Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning , 2022, ICLR.

[10]  Mo Yu,et al.  On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? , 2022, NAACL.

[11]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[12]  Nikita Nangia,et al.  Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions , 2022, LNLS.

[13]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[14]  J. Weston,et al.  Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion , 2022, EMNLP.

[15]  Jacob Menick,et al.  Teaching language models to support answers with verified quotes , 2022, ArXiv.

[16]  Angeliki Lazaridou,et al.  Internet-augmented language models through few-shot prompting for open-domain question answering , 2022, ArXiv.

[17]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[18]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[19]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[20]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[21]  Phu Mon Htut,et al.  BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[22]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[23]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[24]  Laura Forlano,et al.  Participation Is not a Design Fix for Machine Learning , 2020, EAAMO.

[25]  Marcel van Gerven,et al.  Explainable Deep Learning: A Field Guide for the Uninitiated , 2020, J. Artif. Intell. Res..

[26]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[27]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[28]  Dario Amodei,et al.  A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.

[29]  Jason Weston,et al.  Reason first, then respond: Modular Generation for Knowledge-infused Dialogue , 2021, EMNLP.

[30]  Owain Evans,et al.  Truthful AI: Developing and governing AI that does not lie , 2021, ArXiv.

[31]  Jan Leike,et al.  Recursively Summarizing Books with Human Feedback , 2021, ArXiv.

[32]  Po-Sen Huang,et al.  Challenges in Detoxifying Language Models , 2021, EMNLP.

[33]  Majid Yazdani,et al.  Cross-Policy Compliance Detection via Question Answering , 2021, EMNLP.

[34]  Matthew Lease,et al.  The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support , 2021, CHI.

[35]  Dan Klein,et al.  Detoxifying Language Models Risks Marginalizing Minority Voices , 2021, NAACL.

[36]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[37]  Jackie Kay,et al.  Fairness for Unobserved Characteristics: Insights from Technological Impacts on Queer Communities , 2021, AIES.

[38]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[39]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[40]  Dawn Song,et al.  Aligning AI With Shared Human Values , 2020, ICLR.

[41]  Jordan L. Boyd-Graber,et al.  Toward Deconfounding the Effect of Entity Demographics for Question Answering Accuracy , 2021, Conference on Empirical Methods in Natural Language Processing.

[42]  Max Jaderberg,et al.  Open-Ended Learning Leads to Generally Capable Agents , 2021, ArXiv.

[43]  Michele Banko,et al.  A Unified Taxonomy of Harmful Content , 2020, ALW.

[44]  Kris McGuffie,et al.  The Radicalization Risks of GPT-3 and Advanced Neural Language Models , 2020, ArXiv.

[45]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[46]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[47]  Percy Liang,et al.  Selective Question Answering under Domain Shift , 2020, ACL.

[48]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[49]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[50]  H. Francis Song,et al.  A Distributional View on Multi-Objective Policy Optimization , 2020, ICML.

[51]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[52]  R. Geiger,et al.  ORES , 2019, Proc. ACM Hum. Comput. Interact..

[53]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[54]  Ariel D. Procaccia,et al.  WeBuildAI , 2019, Proc. ACM Hum. Comput. Interact..

[55]  Emily Ahn,et al.  Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[56]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[57]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[58]  Jason Weston,et al.  Finding Generalizable Evidence by Learning to Convince Q&A Models , 2019, EMNLP.

[59]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[60]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[61]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[62]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[63]  S. Gershman How to never be wrong , 2018, Psychonomic Bulletin & Review.

[64]  Ran El-Yaniv,et al.  SelectiveNet: A Deep Neural Network with an Integrated Reject Option , 2019, ICML.

[65]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.

[66]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[67]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[68]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.

[69]  Matthew Lease,et al.  But Who Protects the Moderators? The Case of Crowdsourced Image Moderation , 2018, 1804.10999.

[70]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[71]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[72]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[73]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[74]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[75]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[76]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[77]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[78]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[79]  Jordan L. Boyd-Graber,et al.  Besting the Quiz Master: Crowdsourcing Incremental Classification Games , 2012, EMNLP.

[80]  Mary Ann Mason,et al.  Keeping Women in the Science Pipeline , 2011 .

[81]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[82]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[83]  Siobhan Chapman Logic and Conversation , 2005 .

[84]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[85]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[86]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[87]  Z. Kunda,et al.  The case for motivated reasoning. , 1990, Psychological bulletin.

[88]  A. Elo The rating of chessplayers, past and present , 1978 .

[89]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .