Action Recognition and Detection by Combining Motion and Appearance Features

We present an action recognition and detection system from temporally untrimmed videos by combining motion and appearance features. Motion and appearance provides two complementary cues for human action understanding from videos. For motion features, we adopt the Fisher vector representation with improved dense trajectories due to its rich descriptive capacity. For appearance feature, we choose the deep convolutional neural network activations due to its recent success in image based tasks. With this fused feature of iDT and CNN, we train a SVM classifier for each action class in one-vs-all scheme. We report both the recognition and detection results of our system on THUMOS

[1]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .