Watch-Bot: Unsupervised learning for reminding humans of forgotten actions

We present a robotic system that watches a human using a Kinect v2 RGB-D sensor, detects what he forgot to do while performing an activity, and if necessary reminds the person using a laser pointer to point out the related object. Our simple setup can be easily deployed on any assistive robot. Our approach is based on a learning algorithm trained in a purely unsupervised setting, which does not require any human annotations. This makes our approach scalable and applicable to variant scenarios. Our model learns the action/object co-occurrence and action temporal relations in the activity, and uses the learned rich relationships to infer the forgotten action and the related object. We show that our approach not only improves the unsupervised action segmentation and action cluster assignment performance, but also effectively detects the forgotten actions on a challenging human activity RGB-D video dataset. In robotic experiments, we show that our robot is able to remind people of forgotten actions successfully.

[1]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yun Jiang,et al.  Modeling High-Dimensional Humans for Activity Anticipation using Gaussian Process Latent CRFs , 2014, Robotics: Science and Systems.

[5]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[6]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[7]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[8]  Qiang Ji,et al.  A Hierarchical Context Model for Event Recognition in Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  C. Elkan,et al.  Topic Models , 2008 .

[10]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[11]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[12]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[13]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lasitha Piyathilaka,et al.  Human Activity Recognition for Domestic Robots , 2013, FSR.

[16]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[18]  Bingbing Ni,et al.  Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[20]  Advait Jain,et al.  A clickable world: Behavior selection through pointing and context for mobile manipulation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Gwenn Englebienne,et al.  Learning to Recognize Human Activities from Soft Labeled Data , 2014, Robotics: Science and Systems.

[22]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[23]  Alois Knoll,et al.  Action recognition using ensemble weighted multi-instance learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Jesse Hoey,et al.  Sensor-Based Activity Recognition , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Julian Ryde,et al.  Estimating Human Dynamics On-the-fly Using Monocular Video For Pose Estimation , 2012, Robotics: Science and Systems.

[26]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[28]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[29]  Erik B. Sudderth,et al.  The Doubly Correlated Nonparametric Topic Model , 2011, NIPS.

[30]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Balaraman Ravindran,et al.  Activity Recognition for Natural Human Robot Interaction , 2014, ICSR.

[32]  Yasushi Makihara,et al.  Action recognition using dynamics features , 2011, 2011 IEEE International Conference on Robotics and Automation.

[33]  Sung-Bae Cho,et al.  Activity recognition based on wearable sensors using selection/fusion hybrid ensemble , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[34]  Rüdiger Dillmann,et al.  Feature Set Selection and Optimal Classifier for Human Activity Recognition , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.