Deep Action Parsing in Videos With Large-Scale Synthesized Data

Action parsing in videos with complex scenes is an interesting but challenging task in computer vision. In this paper, we propose a generic 3D convolutional neural network in a multi-task learning manner for effective Deep Action Parsing (DAP3D-Net) in videos. Particularly, in the training phase, action localization, classification, and attributes learning can be jointly optimized on our appearance-motion data via DAP3D-Net. For an upcoming test video, we can describe each individual action in the video simultaneously as: Where the action occurs, What the action is, and How the action is performed. To well demonstrate the effectiveness of the proposed DAP3D-Net, we also contribute a new Numerous-category Aligned Synthetic Action data set, i.e., NASA, which consists of 200 000 action clips of over 300 categories and with 33 pre-defined action attributes in two hierarchical levels (i.e., low-level attributes of basic body part movements and high-level attributes related to action motion). We learn DAP3D-Net using the NASA data set and then evaluate it on our collected Human Action Understanding data set and the public THUMOS data set. Experimental results show that our approach can accurately localize, categorize, and describe multiple actions in realistic videos.

[1]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[2]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[3]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  M.H. Ang,et al.  Detection of activities for daily life surveillance: Eating and drinking , 2008, HealthCom 2008 - 10th International Conference on e-health Networking, Applications and Services.

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Teddy Ko,et al.  A survey on behavior analysis in video surveillance for homeland security applications , 2008, 2008 37th IEEE Applied Imagery Pattern Recognition Workshop.

[8]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Tim J. Ellis,et al.  ViHASi: Virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods , 2008, ICDSC.

[10]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[11]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[13]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[15]  P. Siva,et al.  Action Detection in Crowd , 2010, BMVC.

[16]  Li Wang,et al.  Human Action Recognition and Localization in Video Using Structured Learning of Local Space-Time Features , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[17]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[18]  Björn Ommer,et al.  Video parsing for abnormality detection , 2011, 2011 International Conference on Computer Vision.

[19]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[20]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[21]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[22]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[24]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[25]  Shaogang Gong,et al.  Attribute Learning for Understanding Unstructured Social Activity , 2012, ECCV.

[26]  Tal Hassner,et al.  The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Xuelong Li,et al.  Detection of Sudden Pedestrian Crossings for Driving Assistance Systems , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[30]  Sanja Fidler,et al.  Bottom-Up Segmentation for Top-Down Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[32]  Chunheng Wang,et al.  Attribute Regularization Based Human Action Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[33]  Chunheng Wang,et al.  Robust relative attributes for human action recognition , 2013, Pattern Analysis and Applications.

[34]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[36]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Ming Yang,et al.  Regionlets for Generic Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[39]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[43]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[44]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Ling Shao,et al.  Efficient Search and Localization of Human Actions in Video Databases , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Bingbing Ni,et al.  Pose Adaptive Motion Feature Pooling for Human Action Analysis , 2014, International Journal of Computer Vision.

[49]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[50]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[51]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[52]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Nicu Sebe,et al.  Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[56]  Dave Tahmoush Applying action attribute class validation to improve human activity recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[57]  Roberto Cipolla,et al.  DEEP-CARVING: Discovering visual attributes by carving deep neural nets , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Weizhi Nie,et al.  Human Action Recognition Bases on Local Action Attributes , 2015 .

[60]  Xiaogang Wang,et al.  Deeply learned attributes for crowded scene understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Ling Shao,et al.  Fast action retrieval from videos via feature disaggregation , 2017, Comput. Vis. Image Underst..

[62]  Ling Shao,et al.  DAP3D-Net: Where, what and how actions occur in videos? , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).