论文信息 - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks. These tasks often involve complex multi-step reasoning, presenting significant challenges due to the limited paired data connecting human instructions (e.g., making a smiley face) and robot actions (e.g., end-effector movement). Existing approaches relieve this challenge by adopting an open-loop paradigm decomposing high-level instructions into simple sub-task plans, and executing them step-by-step using low-level control models. However, these approaches are short of instant observations in multi-step reasoning, leading to sub-optimal results. To address this issue, we propose to automatically collect a cognitive robot dataset by Large Language Models (LLMs). The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation sequences. To enable efficient data acquisition, we employ elaborated multi-round prompt designs that effectively reduce the burden of extensive human involvement. We further propose a closed-loop multi-modal embodied planning model that autoregressively generates plans by taking image observations as input. To facilitate effective learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and finetune additional vision adapter and Q-former to enable fine-grained spatial perception for manipulation tasks. We conduct experiments to verify the superiority over existing open and closed-loop methods, and achieve a significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4 based robot tasks. Real-world demos are shown in https://www.youtube.com/watch?v=ayAzID1_qQk .

[1] Andy Zeng,et al. TidyBot: Personalized Robot Assistance with Large Language Models , 2023, ArXiv.

[2] M. Pavone,et al. Text2Motion: From Natural Language Instructions to Feasible Plans , 2023, ArXiv.

[3] Mehdi S. M. Sajjadi,et al. PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[4] Peter R. Florence,et al. Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control , 2023, ArXiv.

[5] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6] Sjoerd van Steenkiste,et al. Scaling Vision Transformers to 22 Billion Parameters , 2023, ICML.

[7] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[8] S. Levine,et al. RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[9] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Peter R. Florence,et al. Interactive Language: Talking to Robots in Real Time , 2022, IEEE Robotics and Automation Letters.

[11] Jessica Borja-Diaz,et al. Grounding Language with Visual Affordances over Unstructured Data , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[12] Peter R. Florence,et al. Code as Policies: Language Model Programs for Embodied Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[13] D. Fox,et al. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, CoRL.

[14] Ricardo Garcia Pinel,et al. Instruction-driven history-aware policies for robotic manipulations , 2022, CoRL.

[15] Peter R. Florence,et al. Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[16] S. Levine,et al. Multimodal Masked Autoencoders Learn Transferable Representations , 2022, ArXiv.

[17] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[18] Oier Mees,et al. What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[19] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[20] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[21] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[22] Sergey Levine,et al. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2022, CoRL.

[23] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[24] W. Burgard,et al. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[25] Lihui Wang,et al. Towards proactive human–robot collaboration: A foreseeable cognitive manufacturing paradigm , 2021, Journal of Manufacturing Systems.

[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27] Vikram Srinivasan,et al. Spatial Reasoning from Natural Language Instructions for Robot Manipulation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[28] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29] Qiang Ni,et al. Cognitive computing and wireless communications on the edge for healthcare service robots , 2020, Comput. Commun..

[30] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31] Danica Kragic,et al. Trends and challenges in robot manipulation , 2019, Science.

[32] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[34] Sanem Sariel,et al. Cognition-Enabled Robot Manipulation in Human Environments: Requirements, Recent Work, and Open Problems , 2017, IEEE Robotics & Automation Magazine.

[35] Razvan Pascanu,et al. Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[36] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38] Ashish Kapoor,et al. ChatGPT for Robotics: Design Principles and Model Abilities , 2023, IEEE Access.

[39] Erran L. Li,et al. Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models , 2022, ArXiv.

[40] P. Abbeel,et al. Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models , 2022, ArXiv.