Situated Real-time Interaction with a Virtually Embodied Avatar

A well-known shortcoming of generative language models is that they can generate language which, despite being syntactically and semantically sound, is not grounded in facts [2, 17]. A growing body of recent work has shown how combining language models (LM) with external information sources makes it possible to reduce such hallucinations by letting the model directly attend to external information [15, 18, 24]; an approach commonly referred to as grounding. A common approach to grounding is to monitor a model’s output for the occurrence of certain syntactic patterns (such as the presence of agreed-upon tags) and to let an external source fill in information in the LM’s stead, after which the LM continues its generation [5, 15, 17, 19]. A similar approach may be used to let an LM-based agent interact with an external environment by monitoring for, and executing, generated actions in the environment [4,20]. These types of interactions between the LM and external systems are often enabled by LM augmentation with external plug-and-play modules, and orchestrators that coordinate LM prompts and generation [17]. This kind of external environment grounding is applica-

[1]  R. Memisevic,et al.  Is end-to-end learning enough for fitness activity recognition? , 2023, arXiv.org.

[2]  Yi Wang,et al.  VideoChat: Chat-Centric Video Understanding , 2023, ArXiv.

[3]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[4]  Li Dong,et al.  Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.

[5]  Michel Galley,et al.  Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback , 2023, ArXiv.

[6]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[7]  Y. Shoham,et al.  In-Context Retrieval-Augmented Language Models , 2023, Transactions of the Association for Computational Linguistics.

[8]  Anna A. Ivanova,et al.  Dissociating language and thought in large language models: a cognitive perspective , 2023, ArXiv.

[9]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[10]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[11]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[12]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[13]  Xuedong Huang,et al.  Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention , 2021, IJCAI.

[14]  Gabriel Recchia,et al.  Teaching Autoregressive Language Models Complex Tasks By Demonstration , 2021, ArXiv.

[15]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[16]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[19]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[20]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[21]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jing Yu Koh,et al.  Grounding Language Models to Images for Multimodal Generation , 2023, ArXiv.

[23]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.