Automatic video description generation via LSTM with joint two-stream encoding

In this paper, we propose a novel two-stream framework based on combinational deep neural networks. The framework is mainly composed of two components: one is a parallel two-stream encoding component which learns video encoding from multiple sources using 3D convolutional neural networks and the other is a long-short-term-memory (LSTM)-based decoding language model which transfers the input encoded video representations to text descriptions. The merits of our proposed model are: 1) It extracts both temporal and spatial features by exploring the usage of 3D convolutional networks on both raw RGB frames and motion history images. 2) Our model can dynamically tune the weights of different feature channels since the network is trained end-to-end from learning combinational encoding of multiple features to LSTM-based language model. Our model is evaluated on three public video description datasets: one YouTube clips dataset (Microsoft Video Description Corpus) and two large movie description datasets (MPII Corpus and Montreal Video Annotation Dataset) and achieves comparable or better performance than the state-of-the-art approaches in video caption generation.

[1]  Lior Wolf,et al.  Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation , 2014, ArXiv.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[4]  Rada Mihalcea,et al.  Text Mining for Automatic Image Tagging , 2010, COLING.

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[10]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[13]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yorick Wilks,et al.  NLP for Indexing and Retrieval of Captioned Photographs , 2003, EACL.

[16]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[17]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[18]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[19]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[20]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.