Diving Deep into Modes of Fact Hallucinations in Dialogue Systems

Knowledge Graph(KG) grounded conversations often use large pre-trained models and usually suffer from fact hallucination. Frequently entities with no references in knowledge sources and conversation history are introduced into responses, thus hindering the flow of the conversation -- existing work attempt to overcome this issue by tweaking the training procedure or using a multi-step refining method. However, minimal effort is put into constructing an entity-level hallucination detection system, which would provide fine-grained signals that control fallacious content while generating responses. As a first step to address this issue, we dive deep to identify various modes of hallucination in KG-grounded chatbots through human feedback analysis. Secondly, we propose a series of perturbation strategies to create a synthetic dataset named FADE (FActual Dialogue Hallucination DEtection Dataset). Finally, we conduct comprehensive data analyses and create multiple baseline models for hallucination detection to compare against human-verified data and already established benchmarks.

[1]  Siva Reddy,et al.  FaithDial: A Faithful Benchmark for Information-Seeking Dialogue , 2022, Transactions of the Association for Computational Linguistics.

[2]  Mo Yu,et al.  On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? , 2022, NAACL.

[3]  Jason Weston,et al.  Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents , 2022, NLP4CONVAI.

[4]  Wenhao Liu,et al.  DialFact: A Benchmark for Fact-Checking in Dialogue , 2021, ACL.

[5]  Gaurav Singh Tomar,et al.  Measuring Attribution in Natural Language Generation Models , 2021, Computational Linguistics.

[6]  David Reitter,et al.  Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Features , 2021, ACL.

[7]  Andrea Madotto,et al.  Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding , 2021, EMNLP.

[8]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[9]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[10]  David Reitter,et al.  Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark , 2021, Transactions of the Association for Computational Linguistics.

[11]  Abigail See,et al.  Understanding and predicting user dissatisfaction in a neural generative chatbot , 2021, SIGDIAL.

[12]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[13]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[14]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[15]  Arthur Szlam,et al.  Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness , 2020, ArXiv.

[16]  Hung-yi Lee,et al.  DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs , 2019, EMNLP.

[17]  Seungwhan Moon,et al.  OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs , 2019, ACL.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[20]  Hannah Bast,et al.  Easy access to the freebase dataset , 2014, WWW.

[21]  N. F. F. Ebecken,et al.  On extending F-measure and G-mean metrics to multi-class problems , 2005, Data Mining VI.