Learning to Track for Spatio-Temporal Action Localization

We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. The tracks are scored using a spatio-temporal motion histogram, a descriptor at the track level, in combination with the CNN features. Finally, we perform temporal localization of the action using a sliding-window approach at the track level. We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[3]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[7]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[9]  Luc Van Gool,et al.  On-line Adaption of Class-specific Codebooks for Instance Tracking , 2010, BMVC.

[10]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[11]  Jiri Matas,et al.  Face-TLD: Tracking-Learning-Detection applied to faces , 2010, 2010 IEEE International Conference on Image Processing.

[12]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[13]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[14]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[15]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[16]  Junsong Yuan,et al.  Max-Margin Structured Output Regression for Spatio-Temporal Action Localization , 2012, NIPS.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[19]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[21]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[24]  Cordelia Schmid,et al.  Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[25]  Luca Maria Gambardella,et al.  Fast image scanning with deep max-pooling convolutional neural networks , 2013, 2013 IEEE International Conference on Image Processing.

[26]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Nazli Ikizler-Cinbis,et al.  Action Recognition and Localization by Hierarchical Space-Time Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Cordelia Schmid,et al.  Efficient Action Localization with Approximately Normalized Fisher Vectors , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[34]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[35]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[37]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[38]  Cordelia Schmid,et al.  Occlusion and Motion Reasoning for Long-Term Tracking , 2014, ECCV.

[39]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[40]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[43]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.