WebGPT: Browser-assisted question-answering with human feedback

We fine-tune GPT-3 to answer long-form questions using a text-based webbrowsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

[1]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[2]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[3]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[4]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[5]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[6]  Alec Radford,et al.  Learning to summarize from human feedback , 2020, NeurIPS.

[7]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[8]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[9]  Owain Evans,et al.  Truthful AI: Developing and governing AI that does not lie , 2021, ArXiv.

[10]  Thomas Hofmann,et al.  Boosting Search Engines with Interactive Agents , 2021, ArXiv.

[11]  Christopher Joseph Pal,et al.  Interactive Machine Comprehension with Information Seeking Agents , 2020, ACL.

[12]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[13]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[14]  Aurko Roy,et al.  Hurdles to Progress in Long-form Question Answering , 2021, NAACL.

[15]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[16]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.

[17]  Dilek Z. Hakkani-Tür,et al.  Learning to Navigate the Web , 2018, ICLR.

[18]  Abdul V. Roudsari,et al.  Automation bias: a systematic review of frequency, effect mediators, and mitigators , 2012, J. Am. Medical Informatics Assoc..

[19]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[20]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[21]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[22]  Yi Tay,et al.  Rethinking Search: Making Experts out of Dilettantes , 2021, ArXiv.

[23]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[24]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Daniel Khashabi,et al.  Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge , 2021, ArXiv.

[28]  Yelong Shen,et al.  UnitedQA: A Hybrid Approach for Open Domain Question Answering , 2021, ACL.