AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks. These tasks often involve complex multi-step reasoning, presenting significant challenges due to the limited paired data connecting human instructions (e.g., making a smiley face) and robot actions (e.g., end-effector movement). Existing approaches relieve this challenge by adopting an open-loop paradigm decomposing high-level instructions into simple sub-task plans, and executing them step-by-step using low-level control models. However, these approaches are short of instant observations in multi-step reasoning, leading to sub-optimal results. To address this issue, we propose to automatically collect a cognitive robot dataset by Large Language Models (LLMs). The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation sequences. To enable efficient data acquisition, we employ elaborated multi-round prompt designs that effectively reduce the burden of extensive human involvement. We further propose a closed-loop multi-modal embodied planning model that autoregressively generates plans by taking image observations as input. To facilitate effective learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and finetune additional vision adapter and Q-former to enable fine-grained spatial perception for manipulation tasks. We conduct experiments to verify the superiority over existing open and closed-loop methods, and achieve a significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4 based robot tasks. Real-world demos are shown in https://www.youtube.com/watch?v=ayAzID1_qQk .

[1]  Andy Zeng,et al.  TidyBot: Personalized Robot Assistance with Large Language Models , 2023, ArXiv.

[2]  M. Pavone,et al.  Text2Motion: From Natural Language Instructions to Feasible Plans , 2023, ArXiv.

[3]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[4]  Peter R. Florence,et al.  Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control , 2023, ArXiv.

[5]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6]  Sjoerd van Steenkiste,et al.  Scaling Vision Transformers to 22 Billion Parameters , 2023, ICML.

[7]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[8]  S. Levine,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[9]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Peter R. Florence,et al.  Interactive Language: Talking to Robots in Real Time , 2022, IEEE Robotics and Automation Letters.

[11]  Jessica Borja-Diaz,et al.  Grounding Language with Visual Affordances over Unstructured Data , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[13]  D. Fox,et al.  Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, CoRL.

[14]  Ricardo Garcia Pinel,et al.  Instruction-driven history-aware policies for robotic manipulations , 2022, CoRL.

[15]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[16]  S. Levine,et al.  Multimodal Masked Autoencoders Learn Transferable Representations , 2022, ArXiv.

[17]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[18]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[19]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[20]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[21]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[22]  Sergey Levine,et al.  BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[23]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[24]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[25]  Lihui Wang,et al.  Towards proactive human–robot collaboration: A foreseeable cognitive manufacturing paradigm , 2021, Journal of Manufacturing Systems.

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  Vikram Srinivasan,et al.  Spatial Reasoning from Natural Language Instructions for Robot Manipulation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[28]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29]  Qiang Ni,et al.  Cognitive computing and wireless communications on the edge for healthcare service robots , 2020, Comput. Commun..

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Danica Kragic,et al.  Trends and challenges in robot manipulation , 2019, Science.

[32]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Sanem Sariel,et al.  Cognition-Enabled Robot Manipulation in Human Environments: Requirements, Recent Work, and Open Problems , 2017, IEEE Robotics & Automation Magazine.

[35]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Ashish Kapoor,et al.  ChatGPT for Robotics: Design Principles and Model Abilities , 2023, IEEE Access.

[39]  Erran L. Li,et al.  Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models , 2022, ArXiv.

[40]  P. Abbeel,et al.  Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models , 2022, ArXiv.