Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description

Video description approaches are based on encoderdecoder sentence generation using recurrent neural networks (RNNs) by integrating temporal attention mechanisms into each model, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. We incorporate audio features, in addition to image and motion features, for video description based on encoder-decoder recurrent neural networks (RNNs). To fuse these modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. We apply our new framework for video description using MFCC and state-of-the-art audio features such as SoundNet. Results confirm that our attention-based multimodal fusion of audio features with visual features outperforms conventional video description approaches.

[1]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  John R. Hershey,et al.  Early and late integration of audio features for automatic video description , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[14]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .