Long-Term Activity Forecasting Using First-Person Vision

Long-term activity forecasting deals with the problem of predicting how an agent will complete a full activity, defined as a continuous trajectory and a discrete sequence of sub-actions. While previous data-driven methods only dealt with forecasting 2D trajectories, we present a method that leverages common sense prior knowledge and minimal data. In order to forecast the trajectories, we learn a policy function that maps from states to actions the agent should perform next. Through the use of deep reinforcement learning, our method is able to learn a highly non-linear mapping from agent states to actions. We develop the first forecasting framework that uses ego-centric video input, which is an optimal vantage point for understanding human activities over large spaces. Given an annotated first person video sequence for the activity, we construct a 3D point cloud of the environment and activity paths through 3D space. Based on a limited number of examples, we use reinforcement learning to derive a policy for the entire environment, even for areas that have never been visited during the demonstrated examples. We explore the use of deep reinforcement learning to recover a direct mapping from environmental features to best action. Our approach makes it possible to combine a high dimensional continuous state (namely the local point could density surrounding the agent) with a discrete state portion (action stage of an activity) into a single state for behavior forecasting. The result is a policy that generalizes very well from only a few activity samples. We validate our approach on our First-Person Office Behavior Dataset and show that our method of encoding more prior knowledge leads to an increase in forecasting accuracy. We also demonstrate that the deep reinforcement learning approach is able to achieve higher forecasting accuracy than the traditional alternatives.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[3]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[4]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[5]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[6]  J. Andrew Bagnell,et al.  Approximate MaxEnt Inverse Optimal Control and Its Application for Mental Simulation of Human Interactions , 2015, AAAI.

[7]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[8]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[9]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[10]  Steven M. Seitz,et al.  Multicore bundle adjustment , 2011, CVPR 2011.

[11]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[12]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .

[13]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Stefano Soatto,et al.  Intent-aware long-term prediction of pedestrian motion , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[17]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Changchang Wu,et al.  SiftGPU : A GPU Implementation of Scale Invariant Feature Transform (SIFT) , 2007 .