Attention-based Visual-Audio Fusion for Video Caption Generation

Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).

[1]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4]  Jia Chen,et al.  Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.

[5]  Zhaoxiang Zhang,et al.  Integrating both Visual and Audio Cues for Enhanced Video Caption , 2017, AAAI.

[6]  Christopher Joseph Pal,et al.  Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism , 2015, ArXiv.

[7]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[9]  Di Guo,et al.  Cross-Modal Zero-Shot-Learning for Tactile Object Recognition , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[10]  D. Sosin,et al.  An outbreak of cryptosporidiosis from fresh-pressed apple cider. , 1994, JAMA.

[11]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Laura Custer,et al.  Cordon-Bleu Is an Actin Nucleation Factor and Controls Neuronal Morphology , 2007, Cell.

[14]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[15]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Xu Jia,et al.  Guiding Long-Short Term Memory for Image Caption Generation , 2015, ArXiv.

[18]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[19]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).