Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Recent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel context-aware video captioning model that generates natural language descriptions based on the improved video context understanding. We introduce an external memory, differential neural computer (DNC), to improve video context understanding. DNC naturally learns to use its internal memory for context understanding and also provides contents of its memory as an output for additional connection. By sequentially connecting DNC-based caption models (DNC augmented LSTM) through this memory information, our consecutively connected DNC architecture can understand the context in a video without explicitly searching for event-wise correlation. Our consecutive DNC is sequentially trained with its language model (LSTM) for each video clip to generate context-aware captions with superior quality. In experiments, we demonstrate that our model provides more natural and coherent captions which reflect previous contextual information. Our model also shows superior quantitative performance on video captioning in terms of BLEU (BLEU@4 4.37), METEOR (9.57), and CIDEr-D (28.08).

[1]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[4]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..