论文信息 - Watch-n-Patch: Unsupervised Learning of Actions and Relations

Watch-n-Patch: Unsupervised Learning of Actions and Relations

There is a large variation in the activities that humans perform in their everyday lives. We consider modeling these composite human activities which comprises multiple basic level actions in a completely unsupervised setting. Our model learns high-level co-occurrence and temporal relations between the actions. We consider the video as a sequence of short-term action clips, which contains human-words and object-words. An activity is about a set of action-topics and object-topics indicating which actions are present and which objects are interacting with. We then propose a new probabilistic model relating the words and the topics. It allows us to model long-range action relations that commonly exist in the composite activities, which is challenging in previous works. We apply our model to the unsupervised action segmentation and clustering, and to a novel application that detects forgotten actions, which we call action patching. For evaluation, we contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacting with different objects. Moreover, we develop a robotic system that watches and reminds people using our action patching algorithm. Our robotic setup can be easily deployed on any assistive robots.

[1] Mehran Sahami,et al. Text Mining: Classification, Clustering, and Applications , 2009 .

[2] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[3] J. Sethuraman. A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[4] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[5] Ivan Laptev,et al. Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Subhashis Banerjee,et al. Time based Activity Inference using Latent Dirichlet Allocation , 2009, BMVC.

[7] Mubarak Shah,et al. Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Hema Swetha Koppula,et al. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Gwenn Englebienne,et al. Learning to Recognize Human Activities from Soft Labeled Data , 2014, Robotics: Science and Systems.

[10] Alois Knoll,et al. Action recognition using ensemble weighted multi-instance learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[11] Juan Carlos Niebles,et al. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[12] Juan Carlos Niebles,et al. Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Deva Ramanan,et al. Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Andrew McCallum,et al. Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[15] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[16] Advait Jain,et al. A clickable world: Behavior selection through pointing and context for mobile manipulation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17] Cristian Sminchisescu,et al. Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Chunfeng Yuan,et al. Multi-feature max-margin hierarchical Bayesian model for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Rama Chellappa,et al. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Mubarak Shah,et al. Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Martial Hebert,et al. Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22] Jean-Marc Odobez,et al. Extracting and locating temporal motifs in video scenes using a hierarchical non parametric Bayesian model , 2011, CVPR 2011.

[23] Stan Sclaroff,et al. Space-time tree ensemble for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Mubarak Shah,et al. Video Classification Using Semantic Concept Co-occurrences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Benjamin Z. Yao,et al. Learning and parsing video events with goal and intent prediction , 2013, Comput. Vis. Image Underst..

[26] Aaron F. Bobick,et al. From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Balaraman Ravindran,et al. Activity Recognition for Natural Human Robot Interaction , 2014, ICSR.

[28] Qiang Ji,et al. A Hierarchical Context Model for Event Recognition in Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29] Bart Selman,et al. Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[30] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31] John D. Lafferty,et al. A correlated topic model of Science , 2007, 0708.3601.

[32] Hong-Yuan Mark Liao,et al. Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Ashutosh Saxena,et al. Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[34] Silvio Savarese,et al. Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36] Larry S. Davis,et al. Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Jean-Marc Odobez,et al. A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs , 2013, International Journal of Computer Vision.

[38] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39] Hema Swetha Koppula,et al. Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[40] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Ling Shao,et al. Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Patrick Pérez,et al. Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43] Silvio Savarese,et al. Watch-Bot: Unsupervised learning for reminding humans of forgotten actions , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[44] Rüdiger Dillmann,et al. Feature Set Selection and Optimal Classifier for Human Activity Recognition , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[45] Mubarak Shah,et al. Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46] C. Elkan,et al. Topic Models , 2008 .

[47] Nanning Zheng,et al. Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[48] David B. Dunson,et al. Bayesian Data Analysis , 2010 .

[49] Li Wang,et al. Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[50] Fernando De la Torre,et al. Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[51] K. R. Ramakrishnan,et al. A Cause and Effect Analysis of Motion Trajectories for Modeling Actions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52] Bingbing Ni,et al. Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53] Yi Li,et al. Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[54] Ling Shao,et al. Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55] Jian-Huang Lai,et al. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56] Dieter Fox,et al. RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Sudeep Sarkar,et al. Temporally coherent interpretations for long videos using pattern theory , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Cordelia Schmid,et al. Human Focused Action Localization in Video , 2010, ECCV Workshops.

[59] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60] C. Lawrence Zitnick,et al. Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[61] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[62] Fei-Fei Li,et al. Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[63] Erik B. Sudderth,et al. The Doubly Correlated Nonparametric Topic Model , 2011, NIPS.

[64] Lasitha Piyathilaka,et al. Human Activity Recognition for Domestic Robots , 2013, FSR.

[65] W. Eric L. Grimson,et al. Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[66] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.