Talk2Car: Taking Control of Your Self-Driving Car

A long-term goal of artificial intelligence is to have an agent execute commands communicated through natural language. In many cases the commands are grounded in a visual environment shared by the human who gives the command and the agent. Execution of the command then requires mapping the command into the physical visual space, after which the appropriate action can be taken. In this paper we consider the former. Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars. We provide a detailed comparison with related datasets such as ReferIt, RefCOCO, RefCOCO+, RefCOCOg, Cityscape-Ref and CLEVR-Ref. Additionally, we include a performance analysis using strong state-of-the-art models. The results show that the proposed object referral task is a challenging one for which the models show promising results but still require additional research in natural language processing, computer vision and the intersection of these fields. The dataset can be found on our website: this http URL

[1]  Stefanie Tellex,et al.  Clarifying commands with information-theoretic human-robot dialog , 2013, HRI 2013.

[2]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[3]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[4]  Louis-Philippe Morency,et al.  Visual Referring Expression Recognition: What Do Systems Actually Learn? , 2018, NAACL.

[5]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Albert S. Huang,et al.  Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Commands , 2017, ArXiv.

[7]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Justin Johnson,et al.  DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer , 2018, ArXiv.

[9]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[15]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[16]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[18]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[19]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[21]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[22]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[26]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[27]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[28]  Luc Van Gool,et al.  Object Referring in Videos with Language and Human Gaze , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[33]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Albert S. Huang,et al.  Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Language , 2013 .

[35]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[36]  Jason Weston,et al.  Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[37]  Dengxin Dai,et al.  Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory , 2019, Int. J. Comput. Vis..

[38]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).