New Feature-level Video Classification via Temporal Attention Model

CoVieW 2018 is a new challenge which aims at simultaneous scene and action recognition for untrimmed video [1]. In the challenge, frame-level video features extracted by pre-trained deep convolutional neural network (CNN) are provided for video-level classification. In this paper, a new approach for the video-level classification method is proposed. The proposed method focuses on the analysis in temporal domain and the temporal attention model is developed. To compensate for the differences in the lengths of various videos, temporal padding method is also developed to unify the lengths of videos. Further, data augmentation is performed to enhance some validation accuracy. Finally, for the train/validation in CoView 2018 dataset we recorded the performance of 95.53% accuracy in the scene and 87.17% accuracy in the action using temporal attention model, nonzero padding and data augmentation. The top-1 hamming score is the standard metric in the CoVieW 2018 challenge and 91.35% is obtained by the proposed method.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Stephen Lin,et al.  CoVieW'18: The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild , 2018, ACM Multimedia.

[3]  Paolo Napoletano,et al.  An interactive tool for manual, semi-automatic and automatic video annotation , 2015, Comput. Vis. Image Underst..

[4]  Warren S. Sarle,et al.  Stopped Training and Other Remedies for Overfitting , 1995 .

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Wisuwat Sunhem,et al.  A comparison between shallow and deep architecture classifiers on small dataset , 2016, 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE).

[7]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.