Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me?

In this paper, we present a core technology to enable robot recognition of human activities during human-robot interactions. In particular, we propose a methodology for early recognition of activities from robot-centric videos (i.e., first-person videos) obtained from a robot's viewpoint during its interaction with humans. Early recognition, which is also known as activity prediction, is an ability to infer an ongoing activity at its early stage. We present an algorithm to recognize human activities targeting the camera from streaming videos, enabling the robot to predict intended activities of the interacting person as early as possible and take fast reactions to such activities (e.g., avoiding harmful events targeting itself before they actually occur). We introduce the novel concept of'onset' that efficiently summarizes pre-activity observations, and design a recognition approach to consider event history in addition to visual features from first-person videos. We propose to represent an onset using a cascade histogram of time series gradients, and we describe a novel algorithmic setup to take advantage of such onset for early recognition of activities. The experimental results clearly illustrate that the proposed concept of onset enables better/earlier recognition of human activities from first-person videos collected with a robot. Categories and Subject Descriptors I.2.10 [Artificial Intelligence]: Vision and Scene Understanding–video analysis; I.4.8 [Image Processing and Computer Vision]: Scene Analysis-motion; I.2.9 [Artificial Intelligence]: Robotics–sensors

[1]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[2]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jake K. Aggarwal,et al.  Robust Human-Computer Interaction System Guiding a User by Providing Feedback , 2007, IJCAI.

[5]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Hyun Seung Yang,et al.  Affective Dialogue Communication System with Emotional Memories for Humanoid Robots , 2005, ACII.

[8]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Risto Miikkulainen,et al.  Evolving neural network ensembles for control problems , 2005, GECCO '05.

[10]  Jake K. Aggarwal,et al.  Human activities: Handling uncertainties using fuzzy time intervals , 2008, 2008 19th International Conference on Pattern Recognition.

[11]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[12]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[13]  Sunglok Choi,et al.  Incremental learning of novel activity categories from videos , 2010, 2010 16th International Conference on Virtual Systems and Multimedia.

[14]  Michael S. Ryoo,et al.  Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters , 2016, AAAI 2017.

[15]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[18]  Ji Hoon Joung,et al.  Personal driving diary: Constructing a video archive of everyday driving events , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[19]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[20]  Michael S. Ryoo,et al.  One video is sufficient? Human activity recognition using active video composition , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[21]  J.K. Aggarwal,et al.  Recognition of High-level Group Activities Based on Activities of Individual Members , 2008, 2008 IEEE Workshop on Motion and video Computing.

[22]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[23]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Jake K. Aggarwal,et al.  Detection of abandoned objects in crowded environments , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[25]  Jake K. Aggarwal,et al.  Real-time detection of illegally parked vehicles using 1-D transformation , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[26]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ryo Kurazume,et al.  First-Person Animal Activity Recognition from Egocentric Videos , 2014, 2014 22nd International Conference on Pattern Recognition.

[28]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[29]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Jake K. Aggarwal,et al.  Robot-centric Activity Recognition from First-Person RGB-D Videos , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[31]  Robert Morgan,et al.  Dark Energy , 2015 .

[32]  Sebastian Thrun,et al.  A Gesture Based Interface for Human-Robot Interaction , 2000, Auton. Robots.

[33]  J. Einasto Dark Matter , 2009, 0901.0632.

[34]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.