论文信息 - Learning human activities and object affordances from RGB-D videos

Learning human activities and object affordances from RGB-D videos

Understanding human activities and object affordances are two very important skills, especially for personal robots which operate in human environments. In this work, we consider the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances. Given a RGB-D video, we jointly model the human activities and object affordances as a Markov random field where the nodes represent objects and sub-activities, and the edges represent the relationships between object affordances, their relations with sub-activities, and their evolution over time. We formulate the learning problem using a structural support vector machine (SSVM) approach, where labelings over various alternate temporal segmentations are considered as latent variables. We tested our method on a challenging dataset comprising 120 activity videos collected from 4 subjects, and obtained an accuracy of 79.4% for affordance, 63.4% for sub-activity and 75.0% for high-level activity labeling. We then demonstrate the use of such descriptive labeling in performing assistive tasks by a PR2 robot.

[1] Ashutosh Saxena,et al. Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[2] James M. Rehg,et al. Affordance Prediction via Learned Object Attributes , 2011 .

[3] Pittsburgh,et al. The MOPED framework: Object recognition and pose estimation for manipulation , 2011 .

[4] R. Alami,et al. Mightability maps: A perceptual level decisional framework for co-operative and competitive human-robot interaction , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5] F. Wörgötter,et al. Florentin Wörgötter action relations by observation − Learning the semantics of object , 2011 .

[6] Dieter Fox,et al. Manipulator and object tracking for in-hand 3D object modeling , 2011, Int. J. Robotics Res..

[7] Takeo Kanade,et al. Automated Construction of Robotic Manipulation Programs , 2010 .

[8] Gary R. Bradski,et al. Fast 3D recognition and pose using the Viewpoint Feature Histogram , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9] Li Wang,et al. Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[10] Thorsten Joachims,et al. Learning structural SVMs with latent variables , 2009, ICML '09.

[11] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Darwin G. Caldwell,et al. Robot motor skill coordination with EM-based Reinforcement Learning , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] Yun Jiang,et al. Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[14] Stefanie Tellex,et al. Interpreting and Executing Recipes with a Cooking Robot , 2012, ISER.

[15] Manuel Lopes,et al. Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[16] Danica Kragic,et al. Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[17] Ben Taskar,et al. Learning associative Markov networks , 2004, ICML.

[18] Bart Selman,et al. Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[19] Thorsten Joachims,et al. Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[20] Yang Wang,et al. Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21] Hema Swetha Koppula,et al. Human Activity Learning using Object Affordances from RGB-D Videos , 2012, ArXiv.

[22] Thorsten Joachims,et al. Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[23] Fernando De la Torre,et al. Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[24] Eren Erdal Aksoy,et al. Categorizing object-action relations from semantic scene graphs , 2010, 2010 IEEE International Conference on Robotics and Automation.

[25] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[26] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Rachid Alami,et al. Taskability Graph: Towards analyzing effort based agent-agent affordances , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[28] Barry Ridge,et al. Unsupervised Learning of Basic Object Affordances from Object Properties , 2009 .

[29] Yun Jiang,et al. Hallucinating Humans for Learning Robotic Placement of Objects , 2012, ISER.

[30] Thorsten Joachims,et al. Contextually Guided Semantic Labeling and Search for 3D Point Clouds , 2011, ArXiv.

[31] Martial Hebert,et al. Model recommendation for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Ashutosh Saxena,et al. Efficient grasping from RGBD images: Learning using a new rectangle representation , 2011, 2011 IEEE International Conference on Robotics and Automation.

[33] Michael I. Jordan,et al. Bayesian Nonparametric Inference of Switching Dynamic Linear Models , 2010, IEEE Transactions on Signal Processing.

[34] Michael Werman,et al. A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[35] E. Reed. The Ecological Approach to Visual Perception , 1989 .

[36] Gaurav S. Sukhatme,et al. Tracking and Modeling of Human Activity Using Laser Rangefinders , 2010, Int. J. Soc. Robotics.

[37] Dieter Fox,et al. RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments , 2012, Int. J. Robotics Res..

[38] Markus Vincze,et al. Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes , 2012, 2012 IEEE International Conference on Robotics and Automation.

[39] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Luc Van Gool,et al. Functional categorization of objects using real-time markerless motion capture , 2011, CVPR 2011.

[41] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[42] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44] Dieter Fox,et al. Sparse distance learning for object recognition combining RGB and depth information , 2011, 2011 IEEE International Conference on Robotics and Automation.

[45] Tsuhan Chen,et al. Toward Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47] Trevor Darrell,et al. A geometric approach to robotic laundry folding , 2012, Int. J. Robotics Res..

[48] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49] Daniel P. Huttenlocher,et al. Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[50] James M. Rehg,et al. Learning Visual Object Categories for Robot Affordance Prediction , 2010, Int. J. Robotics Res..

[51] Eren Erdal Aksoy,et al. Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[52] Lynne E. Parker,et al. 4-dimensional local spatio-temporal features for human activity recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[53] Vladimir Kolmogorov,et al. Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[54] Fei-Fei Li,et al. Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[55] Thorsten Joachims,et al. Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[56] Thomas Hofmann,et al. Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[57] Luc De Raedt,et al. Statistical Relational Learning of Object Affordances for Robotic Manipulation , 2011, ILP.

[58] Thorsten Joachims,et al. Cutting-plane training of structural SVMs , 2009, Machine Learning.

[59] Dieter Fox,et al. A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[60] Bart Selman,et al. Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[61] Toby Sharp,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[62] Henrik I. Christensen,et al. Robust 3D visual tracking using particle filtering on the special Euclidean group: A combined approach of keypoint and edge features , 2012, Int. J. Robotics Res..

[63] Zhenguo Li,et al. Modeling Scene and Object Contexts for Human Action Retrieval With Few Examples , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[64] José Santos-Victor,et al. Visual learning by imitation with motor representations , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[65] Jiebo Luo,et al. Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[66] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[67] Bingbing Ni,et al. RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[68] Pierre Hansen,et al. Roof duality, complementation and persistency in quadratic 0–1 optimization , 1984, Math. Program..

[69] Fernando De la Torre,et al. Maximum Margin Temporal Clustering , 2012, AISTATS.

[70] Scott Kuindersma,et al. Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[71] Nico Blodow,et al. Close-range scene segmentation and reconstruction of 3D point cloud maps for mobile manipulation in domestic environments , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[72] Wanqing Li,et al. Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[73] Yun Jiang,et al. Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[74] Benjamin Rosman,et al. Learning spatial relationships between objects , 2011, Int. J. Robotics Res..