Filtration network: A frame sampling strategy via deep reinforcement learning for video captioning

Recently many methods use encoder-decoder framework for video captioning, aiming to translate short videos into natural language. These methods usually use equal interval frame sampling. However, lacking a good efficiency in sampling, it has a high temporal and spatial redundancy, resulting in unnecessary computation cost. In addition, the existing approaches simply splice different visual features on the fully connection layer. Therefore, features cannot be effectively utilized. In order to solve the defects, we proposed filtration network (FN) to select key frames, which is trained by deep reinforcement learning algorithm actor-double-critic. According to behavior psychology, the core idea of actor-double-critic is that the behavior of agent is determined by both the external environment and the internal personality. It avoids the phenomenon of unclear reward and sparse feedback in training because it gives steady feedback after each action. The key frames are sent to combine codec network (CCN) to generate sentences. The operation of feature combination in CCN make fusion of visual features by complex number representation to make good semantic modeling. Experiments and comparisons with other methods on two datasets (MSVD/MSR-VTT) show that our approach achieves better performance in terms of four metrics, BLEU-4, METEOR, ROUGE-L and CIDEr.

[1]  Lin Ma,et al.  Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jin Young Lee,et al.  Deep multimodal embedding for video captioning , 2019, Multimedia Tools and Applications.

[3]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Weizhi Nie,et al.  Multi-guiding long short-term memory for video captioning , 2018, Multimedia Systems.

[5]  Tieniu Tan,et al.  M3: Multimodal Memory Modelling for Video Captioning , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Dale Schuurmans,et al.  Learning to Generalize from Sparse and Underspecified Rewards , 2019, ICML.

[7]  Yu-Wing Tai,et al.  Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sheng Liu,et al.  SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[11]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nicu Sebe,et al.  Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion , 2019, Neurocomputing.

[13]  Jianfei Yang,et al.  Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature , 2019, Neurocomputing.

[14]  Bin Zhao,et al.  CAM-RNN: Co-Attention Model Based RNN for Video Captioning , 2019, IEEE Transactions on Image Processing.

[15]  Yingming Li,et al.  Recurrent convolutional video captioning with global and local attention , 2019, Neurocomputing.

[16]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[17]  Zhen Yang,et al.  Online scheduling of image satellites based on neural networks and deep reinforcement learning , 2019, Chinese Journal of Aeronautics.

[18]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[19]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Yuxin Peng,et al.  Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[22]  Yu-Gang Jiang,et al.  Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.