Converting video classification problem to image classification with global descriptors and pre-trained network

Motion history image (MHI) is a spatio-temporal template that temporal motion information is collapsed into a single image where intensity is a function of recency of motion. Also, it consists of spatial information. Energy image (EI) based on the magnitude of optical flow is a temporal template that shows only temporal information of motion. Each video can be described in these templates. So, four new methods are introduced in this study. The first three methods are called basic methods. In method 1, each video splits into N groups of consecutive frames and MHI is calculated for each group. Transfer learning with fine-tuning technique has been used for classifying these templates. EIs are used for classifying in method 2 similar to method 1. Fusing two streams of these templates is introduced as method 3. Finally, spatial information is added in method 4. Among these methods, method 4 outperforms others and it is called the proposed method. It achieves the recognition accuracy of 92.30 and 94.50% for UCF Sport and UCF-11 action data sets, respectively. Also, the proposed method is compared with the state-of-the-art approaches and the results show that it has the best performance.

[1]  Min Chen,et al.  Efficient Imbalanced Multimedia Concept Retrieval by Deep Learning on Spark Clusters , 2017, Int. J. Multim. Data Eng. Manag..

[2]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Hyung Jin Chang,et al.  Robust action recognition using local motion and group sparsity , 2014, Pattern Recognit..

[4]  Lei Wang,et al.  Human Action Recognition by Learning Spatio-Temporal Features With Deep Neural Networks , 2018, IEEE Access.

[5]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[6]  N K LiuJames,et al.  Gait flow image , 2011 .

[7]  James Nga-Kwok Liu,et al.  Gait flow image: A silhouette-based gait representation for human identification , 2011, Pattern Recognit..

[8]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Hamid Abrishami Moghaddam,et al.  Video spatiotemporal mapping for human action recognition by convolutional neural network , 2019, Pattern Analysis and Applications.

[10]  Andrea Vedaldi,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Action Recognition with Dynamic Image Networks , 2022 .

[11]  Andrew Gilbert,et al.  Image and video mining through online learning , 2016, Comput. Vis. Image Underst..

[12]  Wenhao Yu,et al.  An attention mechanism based convolutional LSTM network for video action recognition , 2019, Multimedia Tools and Applications.

[13]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Mubarak Shah,et al.  Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes , 2016, International Journal of Computer Vision.

[15]  Mathias Quoy,et al.  Action recognition based on motion of oriented magnitude patterns and feature selection , 2018, IET Comput. Vis..

[16]  Yu-Kun Lai,et al.  Saliency guided local and global descriptors for effective action recognition , 2016, Computational Visual Media.