Watch-n-Patch: Unsupervised Learning of Actions and Relations

There is a large variation in the activities that humans perform in their everyday lives. We consider modeling these composite human activities which comprises multiple basic level actions in a completely unsupervised setting. Our model learns high-level co-occurrence and temporal relations between the actions. We consider the video as a sequence of short-term action clips, which contains human-words and object-words. An activity is about a set of action-topics and object-topics indicating which actions are present and which objects are interacting with. We then propose a new probabilistic model relating the words and the topics. It allows us to model long-range action relations that commonly exist in the composite activities, which is challenging in previous works. We apply our model to the unsupervised action segmentation and clustering, and to a novel application that detects forgotten actions, which we call action patching. For evaluation, we contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacting with different objects. Moreover, we develop a robotic system that watches and reminds people using our action patching algorithm. Our robotic setup can be easily deployed on any assistive robots.

[1]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[4]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[5]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Subhashis Banerjee,et al.  Time based Activity Inference using Latent Dirichlet Allocation , 2009, BMVC.

[7]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Gwenn Englebienne,et al.  Learning to Recognize Human Activities from Soft Labeled Data , 2014, Robotics: Science and Systems.

[10]  Alois Knoll,et al.  Action recognition using ensemble weighted multi-instance learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[12]  Juan Carlos Niebles,et al.  Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[15]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[16]  Advait Jain,et al.  A clickable world: Behavior selection through pointing and context for mobile manipulation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Chunfeng Yuan,et al.  Multi-feature max-margin hierarchical Bayesian model for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mubarak Shah,et al.  Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Jean-Marc Odobez,et al.  Extracting and locating temporal motifs in video scenes using a hierarchical non parametric Bayesian model , 2011, CVPR 2011.

[23]  Stan Sclaroff,et al.  Space-time tree ensemble for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Mubarak Shah,et al.  Video Classification Using Semantic Concept Co-occurrences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Benjamin Z. Yao,et al.  Learning and parsing video events with goal and intent prediction , 2013, Comput. Vis. Image Underst..

[26]  Aaron F. Bobick,et al.  From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Balaraman Ravindran,et al.  Activity Recognition for Natural Human Robot Interaction , 2014, ICSR.

[28]  Qiang Ji,et al.  A Hierarchical Context Model for Event Recognition in Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[30]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[32]  Hong-Yuan Mark Liao,et al.  Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[34]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Larry S. Davis,et al.  Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Jean-Marc Odobez,et al.  A Sequential Topic Model for Mining Recurrent Activities from Long Term Video Logs , 2013, International Journal of Computer Vision.

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[40]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Silvio Savarese,et al.  Watch-Bot: Unsupervised learning for reminding humans of forgotten actions , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Rüdiger Dillmann,et al.  Feature Set Selection and Optimal Classifier for Human Activity Recognition , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[45]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  C. Elkan,et al.  Topic Models , 2008 .

[47]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[49]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[50]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[51]  K. R. Ramakrishnan,et al.  A Cause and Effect Analysis of Motion Trajectories for Modeling Actions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Bingbing Ni,et al.  Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[54]  Ling Shao,et al.  Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Sudeep Sarkar,et al.  Temporally coherent interpretations for long videos using pattern theory , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[59]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[62]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Erik B. Sudderth,et al.  The Doubly Correlated Nonparametric Topic Model , 2011, NIPS.

[64]  Lasitha Piyathilaka,et al.  Human Activity Recognition for Domestic Robots , 2013, FSR.

[65]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[66]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.