Watch-n-patch: Unsupervised understanding of actions and relations

We focus on modeling human activities comprising multiple actions in a completely unsupervised setting. Our model learns the high-level action co-occurrence and temporal relations between the actions in the activity video. We consider the video as a sequence of short-term action clips, called action-words, and an activity is about a set of action-topics indicating which actions are present in the video. Then we propose a new probabilistic model relating the action-words and the action-topics. It allows us to model long-range action relations that commonly exist in the complex activity, which is challenging to capture in the previous works. We apply our model to unsupervised action segmentation and recognition, and also to a novel application that detects forgotten actions, which we call action patching. For evaluation, we also contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacted with different objects. The extensive experiments show the effectiveness of our model.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[6]  C. Elkan,et al.  Topic Models , 2008 .

[7]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[8]  Subhashis Banerjee,et al.  Time based Activity Inference using Latent Dirichlet Allocation , 2009, BMVC.

[9]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[11]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[12]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[13]  Aleksandar Lazarevic,et al.  Small moving targets detection using outlier detection algorithms , 2010, Defense + Commercial Sensing.

[14]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[15]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[16]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[17]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[18]  Erik B. Sudderth,et al.  The Doubly Correlated Nonparametric Topic Model , 2011, NIPS.

[19]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[22]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[24]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Mubarak Shah,et al.  Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Bingbing Ni,et al.  Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[30]  Aaron F. Bobick,et al.  From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Qiang Ji,et al.  A Hierarchical Context Model for Event Recognition in Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Mubarak Shah,et al.  Video Classification Using Semantic Concept Co-occurrences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Stefan Mathe Actions in the Eye , 2014 .

[38]  Ling Shao,et al.  Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Hema Swetha Koppula,et al.  RoboBrain: Large-Scale Knowledge Engine for Robots , 2014, ArXiv.

[40]  Hong-Yuan Mark Liao,et al.  Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[43]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.