Sequence to Sequence -- Video to Text

Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).

[1]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[5]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[6]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[7]  Hrishikesh B. Aradhye,et al.  Video2Text: Learning to Annotate Video Content , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[8]  Yao Zhao,et al.  Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[10]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[11]  Yueming Lu,et al.  A Multi-modal Clustering Method for Web Videos , 2012, ISCTCS.

[12]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[13]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[17]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[18]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[19]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[20]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[21]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[22]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[23]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[24]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[25]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[29]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[32]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[34]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[36]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[37]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[40]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Christopher Joseph Pal,et al.  Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism , 2015, ArXiv.

[42]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[46]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).