Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos

Describing videos in human language is of vital importance in many applications, such as managing massive videos on line and providing descriptive video service (DVS) for blind people. In order to further promote existing video description frameworks, this paper presents an end-to-end deep learning model incorporating Convolutional Neural Networks (CNNs) and Bidirectional Recurrent Neural Networks (BiRNNs) based on a multimodal attention mechanism. Firstly, the model produces richer video representations, including image feature, motion feature and audio feature, than other similar researches. Secondly, BiRNNs model encodes these features in both forward and backward directions. Finally, an attention-based decoder translates sequential outputs of encoder to sequential words. The model is evaluated on Microsoft Research Video Description Corpus (MSVD) dataset. The results demonstrate the necessity of combining BiRNNs with a multimodal attention mechanism and the superiority of this model over other state-of-the-art methods conducted on this dataset.

[1]  Yang Yang,et al.  Bidirectional Long-Short Term Memory for Video Description , 2016, ACM Multimedia.

[2]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[6]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[7]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[9]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[10]  Ramakanth Pasunuru,et al.  Multi-Task Video Captioning with Video and Entailment Generation , 2017, ACL.

[11]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[12]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Kate Saenko,et al.  Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.

[14]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[17]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[20]  Emmanuel d'Angelo,et al.  Fast TV-L1 optical flow for interactivity , 2011, 2011 18th IEEE International Conference on Image Processing.

[21]  Jia Chen,et al.  Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.

[22]  Marcus Rohrbach,et al.  A Multi-scale Multiple Instance Video Description Network , 2015, ArXiv.

[23]  Christopher Joseph Pal,et al.  Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism , 2015, ArXiv.

[24]  Petia Radeva,et al.  Video Description Using Bidirectional Recurrent Neural Networks , 2016, ICANN.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.