Dual-Stream Recurrent Neural Network for Video Captioning

Recent progress in using recurrent neural networks (RNNs) for video description has attracted an increasing interest, due to its capability to encode a sequence of frames for caption generation. While existing methods have studied various features (e.g., CNN, 3D CNN, and semantic attributes) for visual encoding, the representation and fusion of heterogeneous information from multi-modal spaces have not fully explored. Consider that different modalities are often asynchronous, frame-level multi-modal fusion (e.g., concatenation and linear fusion) will negatively influence each modality. In this paper, we propose a dual-stream RNN (DS-RNN) framework to jointly discover and integrate the hidden states of both visual and semantic streams for video caption generation. First, an encoding RNN is used for each stream to flexibly exploit the hidden states of respective modality. Specifically, we proposed an attentive multi-grained encoder module to enhance the local feature learning with global semantics feature. Then, a dual-stream decoder is deployed to integrate the asynchronous yet complementary sequential hidden states from both streams for caption generation. Extensive experiments on three benchmark datasets, namely, MSVD, MSR-VTT, and MPII-MD, show that DS-RNN achieves competitive performance against the state-of-the-art. Additional ablation studies were conducted on various variants of the proposed DS-RNN.

[1]  Valerie C. Scanlon,et al.  Essentials of Anatomy and Physiology , 1991 .

[2]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[5]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[9]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jia Chen,et al.  Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  Jorma Laaksonen,et al.  Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation , 2016, ACM Multimedia.

[13]  Zhou Su,et al.  Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Song Wang,et al.  Visual-Attention-Based Background Modeling for Detecting Infrequently Moving Objects , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[20]  Yongdong Zhang,et al.  Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[22]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[23]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[24]  Hejun Wu,et al.  Weighted Low-Rank Decomposition for Robust Grayscale-Thermal Foreground Detection , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[26]  Thuong Le-Tien,et al.  NIC: A Robust Background Extraction Algorithm for Foreground Detection in Dynamic Scenes , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Mohan S. Kankanhalli,et al.  Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language , 2017, Comput. Vis. Image Underst..

[31]  Yongdong Zhang,et al.  Multi-Level Policy and Reward Reinforcement Learning for Image Captioning , 2018, IJCAI.

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xu Zhao,et al.  Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction , 2017, IEEE Transactions on Multimedia.

[35]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[36]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[37]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[38]  Yuan F. Zheng,et al.  Transductive Video Segmentation on Tree-Structured Model , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[40]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[42]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[47]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[49]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[50]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[51]  Zhenhua Wang,et al.  A Spatio-Temporal CRF for Human Interaction Understanding , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[52]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[54]  Jenq-Neng Hwang,et al.  An Ensemble of Invariant Features for Person Reidentification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.