Anticipating Human Activities Using Object Affordances for Reactive Robotic Response

An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for reactive responses. Furthermore, anticipation can even improve the detection accuracy of past activities. The challenge, however, is two-fold: We need to capture the rich context for modeling the activities and object affordances, and we need to anticipate the distribution over a large space of future human activities. In this work, we represent each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances. We then consider each ATCRF as a particle and represent the distribution over the potential futures using a set of particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, we first show that anticipation improves the state-of-the-art detection results. We then show that for new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 84.1, 74.4 and 62.2 percent for an anticipation time of 1, 3 and 10 seconds respectively. Finally, we also show a robot using our algorithm for performing a few reactive responses.

[1]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[2]  Yun Jiang,et al.  Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[3]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[5]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Andrew Blake,et al.  Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ramakant Nevatia,et al.  Coupled Hidden Semi Markov Models for Activity Recognition , 2007, 2007 IEEE Workshop on Motion and Video Computing (WMVC'07).

[9]  Shaogang Gong,et al.  Attribute Learning for Understanding Unstructured Social Activity , 2012, ECCV.

[10]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[11]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[12]  Philip S. Yu,et al.  Mining Sequence Classifiers for Early Prediction , 2008, SDM.

[13]  J. Faraway,et al.  Modelling three‐dimensional trajectories by using Bézier curves with application to hand motion , 2007 .

[14]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[15]  Nicholas Roy,et al.  Probabilistic Modeling of Human Movements for Intention Inference , 2013 .

[16]  Siddhartha S. Srinivasa,et al.  Formalizing Assistive Teleoperation , 2012, Robotics: Science and Systems.

[17]  Siddhartha S. Srinivasa,et al.  Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Fredrik Gustafsson,et al.  Particle filters for positioning, navigation, and tracking , 2002, IEEE Trans. Signal Process..

[21]  James M. Rehg,et al.  Learning and Inferring Motion Patterns using Parametric Segmental Switching Linear Dynamic Systems , 2008, International Journal of Computer Vision.

[22]  Stefanos Nikolaidis,et al.  Human-robot cross-training: Computational formulation, modeling and evaluation of a human team training strategy , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[23]  Frank Dellaert,et al.  MCMC-based particle filtering for tracking a variable number of interacting targets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[26]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[27]  Hema Swetha Koppula,et al.  URL normalization for de-duplication of web pages , 2009, CIKM.

[28]  Wolfram Burgard,et al.  Feature-Based Prediction of Trajectories for Socially Compliant Navigation , 2012, Robotics: Science and Systems.

[29]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[30]  Yun Jiang,et al.  Infinite Latent Conditional Random Fields for Modeling Environments through Humans , 2013, Robotics: Science and Systems.

[31]  James M. Rehg,et al.  Affordance Prediction via Learned Object Attributes , 2011 .

[32]  Tsuhan Chen,et al.  3D-Based Reasoning with Blocks, Support, and Stability , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  E. Guizzo,et al.  The rise of the robot worker , 2012, IEEE Spectrum.

[34]  William Whittaker,et al.  Conditional particle filters for simultaneous mobile robot localization and people-tracking , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[35]  Sebastian Thrun,et al.  FastSLAM: a factored solution to the simultaneous localization and mapping problem , 2002, AAAI/IAAI.

[36]  Patric Jensfelt,et al.  Feature based CONDENSATION for mobile robot localization , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[37]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Fernando De la Torre,et al.  Action unit detection with segment-based SVMs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Cristian Sminchisescu,et al.  Probabilistic Joint Image Segmentation and Labeling , 2011, NIPS.

[40]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[41]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[42]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[43]  Takeo Kanade,et al.  Automated Construction of Robotic Manipulation Programs , 2010 .

[44]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[46]  Pierre Hansen,et al.  Roof duality, complementation and persistency in quadratic 0–1 optimization , 1984, Math. Program..

[47]  Zhao Gang,et al.  Dynamic Probabilistic Network Based Human Action Recognition , 2016, ArXiv.

[48]  Dieter Fox,et al.  KLD-Sampling: Adaptive Particle Filters , 2001, NIPS.

[49]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[50]  Fernando De la Torre,et al.  Maximum Margin Temporal Clustering , 2012, AISTATS.

[51]  James M. Rehg,et al.  Learning Visual Object Categories for Robot Affordance Prediction , 2010, Int. J. Robotics Res..

[52]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[53]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[54]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[55]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Ramakant Nevatia,et al.  Large-scale event detection using semi-hidden Markov models , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[57]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[58]  Kevin P. Murphy,et al.  Modeling changing dependency structure in multivariate time series , 2007, ICML '07.

[59]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[61]  Clayton D. Scott,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Zhenhua Wang,et al.  Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .

[65]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[66]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[68]  Nando de Freitas,et al.  Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks , 2000, UAI.

[69]  Ashutosh Saxena,et al.  Co-evolutionary predictors for kinematic pose inference from RGBD images , 2012, GECCO '12.

[70]  D. Fox,et al.  People Tracking with Anonymous and ID-Sensors Using Rao-Blackwellised Particle Filters , 2003, IJCAI.

[71]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[72]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Bernhard Schölkopf,et al.  Probabilistic Modeling of Human Movements for Intention Inference , 2012, Robotics: Science and Systems.

[74]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[75]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Thorsten Joachims,et al.  Contextually Guided Semantic Labeling and Search for 3D Point Clouds , 2011, ArXiv.

[77]  Bart Selman,et al.  Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[78]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[79]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[80]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[81]  Krishna P. Gummadi,et al.  Growth of the flickr social network , 2008, WOSN '08.

[82]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Lynne E. Parker,et al.  4-dimensional local spatio-temporal features for human activity recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[84]  Trevor Darrell,et al.  Hidden-state Conditional Random Fields , 2006 .

[85]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[86]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[87]  Zaïd Harchaoui,et al.  Kernel Change-point Analysis , 2008, NIPS.

[88]  Alan Fern,et al.  Discriminatively trained particle filters for complex multi-object tracking , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Larry S. Davis,et al.  Event Modeling and Recognition Using Markov Logic Networks , 2008, ECCV.

[90]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[91]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[92]  Vladimir Kolmogorov,et al.  Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[93]  Sung-Bae Cho,et al.  Activity recognition based on wearable sensors using selection/fusion hybrid ensemble , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[94]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.