Online Action Detection

In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 h of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.

[1]  Derek Hoiem,et al.  Diagnosing Error in Object Detectors , 2012, ECCV.

[2]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[4]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[6]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[8]  Yi Wang,et al.  Sequential Max-Margin Event Detectors , 2014, ECCV.

[9]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[15]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[16]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Kiriakos N. Kutulakos Trends and Topics in Computer Vision , 2010, Lecture Notes in Computer Science.

[18]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[20]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Zhe Wang,et al.  Exploring Fisher vector and deep networks for action spotting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[25]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[28]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[29]  Ramakant Nevatia,et al.  Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images , 2015, ACM Multimedia.

[30]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[31]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Gang Yu,et al.  Predicting human activities using spatio-temporal structure of interest points , 2012, ACM Multimedia.

[36]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.