Understanding temporal structure for video captioning

Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.

[1]  Michael S. Ryoo,et al.  Temporal attention filters for human activity recognition in videos , 2016, ArXiv.

[2]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[3]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bing Li,et al.  Salient Object Detection via Structured Matrix Decomposition. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[6]  Xin Wang,et al.  Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[9]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[10]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[16]  Ramesh Raskar,et al.  Object classification through scattering media with deep learning on time resolved measurement. , 2017, Optics express.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[21]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[22]  Jorma Laaksonen,et al.  Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation , 2016, ACM Multimedia.

[23]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[26]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[27]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[28]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[30]  Zhe Gan,et al.  Adaptive Feature Abstraction for Translating Video to Text , 2018, AAAI.

[31]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[32]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[33]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[35]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Xirong Li,et al.  Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction , 2016, ArXiv.

[41]  Guosheng Lin,et al.  Exploring Context with Deep Structured Models for Semantic Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[45]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[47]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Rong Xie,et al.  Shot boundary detection using convolutional neural networks , 2016, 2016 Visual Communications and Image Processing (VCIP).

[49]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[50]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[53]  Shuicheng Yan,et al.  A survey on deep learning-based fine-grained object classification and semantic segmentation , 2017, International Journal of Automation and Computing.

[54]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Bing Li,et al.  Salient Object Detection via Structured Matrix Decomposition. , 2016, IEEE transactions on pattern analysis and machine intelligence.

[57]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[58]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[60]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[62]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).