Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web

In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by "watching" unconstrained videos with high accuracy.

[1]  M. Jeannerod The timing of natural prehension movements. , 1984, Journal of motor behavior.

[2]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1989, ANLP.

[3]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[4]  Noam Chomsky Lectures on Government and Binding: The Pisa Lectures , 1993 .

[5]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[6]  Karun B. Shimoga,et al.  Robot Grasp Synthesis Algorithms: A Survey , 1996, Int. J. Robotics Res..

[7]  Mark Steedman,et al.  Plans, Affordances, And Combinatory Grammar , 2002 .

[8]  Samy Bengio,et al.  The Handbook of Brain Theory and Neural Networks , 2002 .

[9]  Y. Aloimonos,et al.  Discovering a Language for Human Activity 1 , 2005 .

[10]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[12]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[13]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[14]  Yi Li,et al.  Learning shift-invariant sparse representation of actions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[16]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[17]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[18]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[19]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[20]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Douglas Summers-Stay,et al.  Using a minimal action grammar for activity understanding in the real world , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Yiannis Aloimonos,et al.  The minimalist grammar of action , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[24]  Yiannis Aloimonos,et al.  Minimalist plans for interpreting manipulation actions , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Moritz Tenorth,et al.  Automated alignment of specifications of everyday manipulation tasks , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Tae-Kyun Kim,et al.  A syntactic approach to robot imitation learning using probabilistic activity grammars , 2013, Robotics Auton. Syst..

[28]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Danica Kragic,et al.  A Metric for Comparing the Anthropomorphic Motion Capability of Artificial Hands , 2013, IEEE Transactions on Robotics.

[30]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[31]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[32]  Yiannis Aloimonos,et al.  A Cognitive System for Understanding Human Manipulation Actions , 2014 .

[33]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..