Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding

For video understanding, namely analyzing who did what in a video, actions along with objects are primary elements. Most studies on actions have handled recognition problems for a well-trimmed video and focused on enhancing their classification performance. However, action detection, including localization as well as recognition, is required because, in general, actions intersect in time and space. In addition, most studies have not considered extensibility for a newly added action that has been previously trained. Therefore, proposed in this paper is an extensible hierarchical method for detecting generic actions, which combine object movements and spatial relations between two objects, and inherited actions, which are determined by the related objects through an ontology and rule based methodology. The hierarchical design of the method enables it to detect any interactive actions based on the spatial relations between two objects. The method using object information achieves an F-measure of 90.27%. Moreover, this paper describes the extensibility of the method for a new action contained in a video from a video domain that is different from the dataset used.

[1]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ehud Rivlin,et al.  Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[4]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[5]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Greg Mori,et al.  Discriminative key-component models for interaction detection and recognition , 2015, Comput. Vis. Image Underst..

[13]  Yi-Ping Phoebe Chen,et al.  GHT-based associative memory learning and its application to Human action detection and classification , 2013, Pattern Recognit..

[14]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[15]  Liming Chen,et al.  Combining ontological and temporal formalisms for composite activity modelling and recognition in smart homes , 2014, Future Gener. Comput. Syst..

[16]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[17]  Mubarak Shah,et al.  Recognizing human actions , 2005, VSSN@MM.

[18]  Alberto Del Bimbo,et al.  Video Annotation and Retrieval Using Ontologies and Rule Learning , 2010, IEEE MultiMedia.

[19]  Georgios Meditskos,et al.  MetaQ: A knowledge-driven framework for context-aware activity recognition combining SPARQL and OWL 2 activity patterns , 2016, Pervasive Mob. Comput..

[20]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[21]  Grigoris Antoniou,et al.  Rule-Based Real-Time ADL Recognition in a Smart Home Environment , 2016, RuleML.

[22]  Ihn-Han Bae,et al.  An ontology-based approach to ADL recognition in smart homes , 2014, Future Gener. Comput. Syst..

[23]  Nazli Ikizler-Cinbis,et al.  Action Recognition and Localization by Hierarchical Space-Time Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Alberto Del Bimbo,et al.  Learning ontology rules for semantic video annotation , 2008, MS '08.

[26]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[28]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Adnan Yazici,et al.  Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-Based Model , 2013, IEEE Transactions on Knowledge and Data Engineering.

[30]  Chris D. Nugent,et al.  A Knowledge-Driven Approach to Activity Recognition in Smart Homes , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Kyuchang Kang,et al.  A Knowledge-Driven Approach to Interactive Event Recognition for Semantic Video Understanding , 2016, 2016 6th International Conference on IT Convergence and Security (ICITCS).

[32]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Claudio Bettini,et al.  OWL 2 modeling and reasoning with complex human activities , 2011, Pervasive Mob. Comput..

[34]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[35]  Qiang Ji,et al.  Hierarchical Context Modeling for Video Event Recognition , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[37]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[38]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[40]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[42]  Alberto Del Bimbo,et al.  Event detection and recognition for semantic annotation of video , 2010, Multimedia Tools and Applications.

[43]  Kyuchang Kang,et al.  ActionNet-VE Dataset: A Dataset for Describing Visual Events by Extending VIRAT Ground 2.0 , 2015, 2015 8th International Conference on Signal Processing, Image Processing and Pattern Recognition (SIP).

[44]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Rama Chellappa,et al.  An ontology based approach for activity recognition from video , 2008, ACM Multimedia.

[46]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[47]  H. Lan,et al.  SWRL : A semantic Web rule language combining OWL and ruleML , 2004 .