Adaptive Spatial Location With Balanced Loss for Video Captioning

Many pioneering approaches have verified the effectiveness of utilizing the global temporal and local object information for video understanding tasks and have achieved significant progress. However, existing methods utilize object detectors to extract all objects overall video frames. This may bring performance degradation due to the information redundancy both spatially and temporally. To address this problem, we propose an adaptive spatial location module for the video captioning task which dynamically predicts an important position of each video frame in the procedure of generating the description sentence. The proposed adaptive spatial location method not only makes our model focus on local object information, but also reduces time and memory consumption brought by the temporal redundancy in extensive video frames and improves the accuracy of generated description. Besides, we propose a balanced loss function to address the class imbalance problem existing in training data. The proposed balanced loss assigns different weight to each word of ground-truth sentence in the training process which can generate more diversified description sentences. Extensive experimental results on the MSVD and MSR-VTT dataset show that the proposed method achieves competitive performance compared to state-of-the-art methods.