A Multimodal Aggregation Network With Serial Self-Attention Mechanism for Micro-Video Multi-Label Classification

Currently, micro-videos have attracted increasing attention due to their unique properties and great commercial value. Considering that micro-videos naturally incorporate multimodal information, a powerful representation method for distinct joint multimodal representations is essential for real applications. Inspired by the potential of attention neural network architectures over various tasks, we propose a multimodal aggregation network (MANET) with a serial self-attention mechanism to perform tasks of micro-video multi-label classification. Specifically, we first propose a parallel content-dependent graph neural networks (CDGNN) module, which explores category-related embeddings of micro-videos by disentangling category relations into modality-specific and modality-shared category dependency patterns. Then we introduce a serial self-attention (SSA) module to transmit the multimodal information in sequential order, in which an aggregation bottleneck is incorporated to better collect and condense the significant information. Experiments conducted on a large-scale multi-label micro-video dataset demonstrate that our proposed method has achieved competitive results compared with several state-of-the-art methods.

[1]  Yixiang Lu,et al.  Learning view-specific labels and label-feature dependence maximization for multi-view multi-label classification , 2022, Appl. Soft Comput..

[2]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[4]  Yu Qiao,et al.  Attention-Driven Dynamic Graph Convolutional Network for Multi-label Image Recognition , 2020, ECCV.

[5]  Yuexian Zou,et al.  Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network , 2020, IEEE Signal Processing Letters.

[6]  Zhenzhong Chen,et al.  A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction , 2020, WWW.

[7]  Zheng-Jun Zha,et al.  Learning and Fusing Multiple User Interest Representations for Micro-Video and Movie Recommendations , 2020, IEEE Transactions on Multimedia.

[8]  Gang Cao,et al.  Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification , 2019, Multimedia Tools and Applications.

[9]  Zhiming Luo,et al.  Manifold regularized discriminative feature selection for multi-label learning , 2019, Pattern Recognit..

[10]  Hefeng Wu,et al.  Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Xiaobo Wang,et al.  Multi-View Multi-Label Learning with View-Specific Information Extraction , 2019, IJCAI.

[12]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Qi Tian,et al.  Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[14]  Zhi-Hua Zhou,et al.  Multi-Label Learning with Global and Local Label Correlation , 2017, IEEE Transactions on Knowledge and Data Engineering.

[15]  Yu-Chiang Frank Wang,et al.  Learning Deep Latent Spaces for Multi-Label Classification , 2017, ArXiv.

[16]  Tat-Seng Chua,et al.  Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model , 2016, ACM Multimedia.

[17]  Cheng Li,et al.  Conditional Bernoulli Mixtures for Multi-label Classification , 2016, ICML.

[18]  Yun Fu,et al.  Robust Multi-View Subspace Learning through Dual Low-Rank Decompositions , 2016, AAAI.

[19]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[23]  Murat Akbacak,et al.  Softening quantization in bag-of-audio-words , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Vincent W. L. Tam,et al.  A Cross-Attention BERT-Based Framework for Continuous Sign Language Recognition , 2022, IEEE Signal Processing Letters.

[25]  Chenguang Song,et al.  Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer , 2022, IEEE Signal Processing Letters.

[26]  Changsheng Xu,et al.  Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation , 2022, IEEE Transactions on Multimedia.

[27]  Qingling Cai,et al.  Multi-Label Classification of Fundus Images With Graph Convolutional Network and Self-Supervised Learning , 2021, IEEE Signal Processing Letters.