Boosting Video Description Generation by Explicitly Translating from Frame-Level Captions

Automatically describing video content with natural language is a fundamental challenge of computer vision. The recent advanced technique that approaches this problem is Recurrent Neural Networks (RNN). The need to train RNN on large-scale complex and diverse videos and their associated language, however, makes the task human-labeling intensive and computationally expensive. Moreover, the results can suffer from robustness problem, especially when there are rich of temporal dynamics in the sequence of video frames. We demonstrate in this paper that the above two limitations can be mitigated by jointly exploring the largely available data from image domain and representing each frame by high-level attributes rather than visual features. The former leverages the learnt models on image captioning benchmark to generate caption for each video frame, while the latter explicitly incorporates the obtained captions which are regarded as the attributes of each frame. Specifically, we propose a novel sequence to sequence architecture to generate descriptions for videos, in a sense that the inputs are the captions of sequential frames and it outputs words sequentially. On a widely used YouTube2Text dataset, our proposal is shown to be powerful with superior performance over several state-of-the-art methods including both architectures that are purely developed on video data and RNN-based models which translate directly from visual features to language.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[3]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[4]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).