RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling

Retrieval-augmented language models show promise in addressing issues like outdated information and hallucinations in language models (LMs). However, current research faces two main problems: 1) determining what information to retrieve, and 2) effectively combining retrieved information during generation. We argue that valuable retrieved information should not only be related to the current source text but also consider the future target text, given the nature of LMs that model future tokens. Moreover, we propose that aggregation using latent variables derived from a compact latent space is more efficient than utilizing explicit raw text, which is limited by context length and susceptible to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE). It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Additionally, we leverage the VAE to initialize the latent space and adopt the probabilistic form of the retrieval generation paradigm by expanding the Gaussian prior distribution into a Gaussian mixture distribution. Theoretical analysis provides an optimizable upper bound for RegaVAE. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.

[1]  Huawei Shen,et al.  BERM: Training the Balanced and Extractable Representation for Matching to Improve Generalization Ability of Dense Retrieval , 2023, ACL.

[2]  M. Gales,et al.  SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models , 2023, EMNLP.

[3]  M. Lewis,et al.  REPLUG: Retrieval-Augmented Black-Box Language Models , 2023, NAACL.

[4]  Maosong Sun,et al.  Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation , 2022, NAACL.

[5]  Md. Faisal Mahbub Chowdhury,et al.  Re2G: Retrieve, Rerank, Generate , 2022, NAACL.

[6]  Zhongliang Yang,et al.  AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling , 2022, ArXiv.

[7]  Lemao Liu,et al.  A Survey on Retrieval-Augmented Text Generation , 2022, ArXiv.

[8]  Jonathan Berant,et al.  Learning To Retrieve Prompts for In-Context Learning , 2021, NAACL.

[9]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[10]  Xueqi Cheng,et al.  Adaptive Information Seeking for Open-Domain Question Answering , 2021, EMNLP.

[11]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[12]  Xueqi Cheng,et al.  Match-Ignition: Plugging PageRank into Transformer for Long-form Text Matching , 2021, CIKM.

[13]  Changyou Chen,et al.  Transformer-based Conditional Variational Autoencoder for Controllable Story Generation , 2021, ArXiv.

[14]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[15]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[16]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[17]  Xiujun Li,et al.  Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space , 2020, EMNLP.

[18]  Gary Marcus,et al.  The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence , 2020, ArXiv.

[19]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[20]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[21]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[22]  Xiaojun Wan,et al.  T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion , 2019, IJCAI.

[23]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[24]  Xiaodong Liu,et al.  Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing , 2019, NAACL.

[25]  Graham Neubig,et al.  Lagging Inference Networks and Posterior Collapse in Variational Autoencoders , 2019, ICLR.

[26]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[27]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[28]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[29]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[30]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[31]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[32]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[33]  Leila Kosseim,et al.  Discrepancy Between Automatic and Manual Evaluation of Summaries , 2012, EvalMetrics@NAACL-HLT.

[34]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.