Action Video Recognition Framework based on NetVLAD with Data Augmentation

In this paper, we propose an end-to-end deep learning framework that could extract the global spatial-temporal features from the videos based on the NetVladframework optimized with data augmentation. Given keyframes extracted from the original videos, we have presented a three-step action recognition framework: the first step of our framework is given by a data augmentation based on central crop, random crop and keyframe scaling. Then the second step is given by a local feature descriptor of each frame with Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation, which enables the network to capture spatial and temporal features simultaneously. The third step is given by aggregating global features for action recognition through the new generalized “Vector of Locally Aggregated Descriptors” (NetVLAD) layer optimized with a novel pooling strategy, avoiding the misjudgment caused by local features. The whole framework is trained and fine-tuned via an end-to-end way. It's demonstrated that the performance of our framework outperforms the state-of-the-art algorithms on UCF10l dataset. The competitive results clearly reveal an efficient action recognition of high accuracy (up to 91.25%) in fast time (close to 1.8s), which will significantly improve performance and application of video action recognition.