Learning Collaborative Action Plans from YouTube Videos

Videos from the World Wide Web provide a rich source of information that robots could use to acquire knowledge about manipulation tasks. Previous work has focused on generating action sequences from unconstrained videos for a single robot performing manipulation tasks by itself. However, robots operating in the same physical space with people need to not only perform actions autonomously, but also coordinate seamlessly with their human counterparts. This often requires representing and executing collaborative manipulation actions, such as handing over a tool or holding an object for the other agent. We present a system for knowledge acquisition of collaborative manipulation action plans that outputs commands to the robot in the form of visual sentence. We show the performance of the system in 12 unlabeled action clips taken from collaborative cooking videos on YouTube. We view this as the first step towards extracting collaborative manipulation action sequences from unconstrained, unlabeled online videos.

[1]  Oliver Kroemer,et al.  Interaction primitives for human-robot cooperation tasks , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[3]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Yiannis Aloimonos,et al.  A Cognitive System for Understanding Human Manipulation Actions , 2014 .

[5]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[6]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[7]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jake K. Aggarwal,et al.  Semantic Representation and Recognition of Continued and Recursive Human Activities , 2009, International Journal of Computer Vision.

[10]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[11]  Hyeonwoo Noh,et al.  Neural Program Synthesis from Diverse Demonstration Videos , 2018, ICML.

[12]  Noam Chomsky Lectures on Government and Binding: The Pisa Lectures , 1993 .

[13]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[14]  Nikolaos G. Tsagarakis,et al.  Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Martial Hebert,et al.  Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Michael S. Ryoo,et al.  Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[20]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Douglas Summers-Stay,et al.  Using a minimal action grammar for activity understanding in the real world , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Silvio Savarese,et al.  ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation , 2018, CoRL.

[23]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[24]  Yiannis Aloimonos,et al.  The minimalist grammar of action , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[26]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[28]  Gaurav S. Sukhatme,et al.  Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot Skills , 2018, ICRA 2019.