A Multiview Approach to Learning Articulated Motion Models

In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline.

[1]  Dieter Fox,et al.  Following directions using statistical machine translation , 2010, HRI 2010.

[2]  Nicholas Roy,et al.  Efficient Grounding of Abstract Spatial Concepts for Natural Language Interaction with Robot Manipulators , 2016, Robotics: Science and Systems.

[3]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jayant Krishnamurthy,et al.  Toward Interactive Grounded Language Acqusition , 2013, Robotics: Science and Systems.

[5]  Dieter Fox,et al.  DART: Dense Articulated Real-Time Tracking , 2014, Robotics: Science and Systems.

[6]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[7]  J. Andrew Bagnell,et al.  Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects , 2013, 2013 IEEE International Conference on Robotics and Automation.

[8]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, HRI 2010.

[9]  Frank Dellaert,et al.  iSAM: Incremental Smoothing and Mapping , 2008, IEEE Transactions on Robotics.

[10]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Jayant Krishnamurthy,et al.  Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[12]  Trevor Darrell,et al.  Open-vocabulary Object Retrieval , 2014, Robotics: Science and Systems.

[13]  Ashutosh Saxena,et al.  Robobarista: Object Part Based Transfer of Manipulation Trajectories from Crowd-Sourcing in 3D Pointclouds , 2015, ISRR.

[14]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[16]  Edwin Olson,et al.  AprilTag: A robust and flexible visual fiducial system , 2011, 2011 IEEE International Conference on Robotics and Automation.

[17]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[18]  Matthew R. Walter,et al.  Learning models for following natural language directions in unknown environments , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[20]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[23]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[24]  Ian D. Walker,et al.  Occlusion-aware reconstruction and manipulation of 3D articulated objects , 2012, 2012 IEEE International Conference on Robotics and Automation.

[25]  Matthew R. Walter,et al.  Learning Semantic Maps from Natural Language Descriptions , 2013, Robotics: Science and Systems.

[26]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[27]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[28]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Patric Jensfelt,et al.  Large-scale semantic mapping and reasoning with heterogeneous modalities , 2012, 2012 IEEE International Conference on Robotics and Automation.

[30]  Oliver Brock,et al.  Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[32]  Wolfram Burgard,et al.  A Probabilistic Framework for Learning Kinematic Models of Articulated Objects , 2011, J. Artif. Intell. Res..

[33]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[34]  Terry Winograd,et al.  Understanding natural language , 1974 .

[35]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[37]  Oliver Brock,et al.  Interactive Perception of Articulated Objects , 2010, ISER.

[38]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[39]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[40]  Oliver Brock,et al.  An integrated approach to visual perception of articulated objects , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[42]  Matthew R. Walter,et al.  Learning spatial-semantic representations from natural language descriptions and scene classifications , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Advait Jain,et al.  Pulling open doors and drawers: Coordinating an omni-directional base and a compliant arm with Equilibrium Point control , 2010, 2010 IEEE International Conference on Robotics and Automation.

[44]  Ivan Laptev,et al.  Joint Discovery of Object States and Manipulation Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[46]  Matthew R. Walter,et al.  Learning Articulated Motions From Visual Demonstration , 2014, Robotics: Science and Systems.

[47]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[48]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[49]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[50]  Gaurav S. Sukhatme,et al.  Active articulation model estimation through interactive perception , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[51]  Stefanie Tellex,et al.  A natural language planner interface for mobile manipulators , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[53]  Jean Oh,et al.  Inferring Maps and Behaviors from Natural Language Instructions , 2015, ISER.

[54]  Fei-Fei Li,et al.  Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[55]  Ashutosh Saxena,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[56]  Wolfram Burgard,et al.  Conceptual spatial representations for indoor mobile robots , 2008, Robotics Auton. Syst..