论文信息 - A Multimodal Aggregation Network With Serial Self-Attention Mechanism for Micro-Video Multi-Label Classification

A Multimodal Aggregation Network With Serial Self-Attention Mechanism for Micro-Video Multi-Label Classification

Currently, micro-videos have attracted increasing attention due to their unique properties and great commercial value. Considering that micro-videos naturally incorporate multimodal information, a powerful representation method for distinct joint multimodal representations is essential for real applications. Inspired by the potential of attention neural network architectures over various tasks, we propose a multimodal aggregation network (MANET) with a serial self-attention mechanism to perform tasks of micro-video multi-label classification. Specifically, we first propose a parallel content-dependent graph neural networks (CDGNN) module, which explores category-related embeddings of micro-videos by disentangling category relations into modality-specific and modality-shared category dependency patterns. Then we introduce a serial self-attention (SSA) module to transmit the multimodal information in sequential order, in which an aggregation bottleneck is incorporated to better collect and condense the significant information. Experiments conducted on a large-scale multi-label micro-video dataset demonstrate that our proposed method has achieved competitive results compared with several state-of-the-art methods.

Yuting Su | Peiguang Jing | Wei Lu | Jiaxin Lin

[1] Yixiang Lu,et al. Learning view-specific labels and label-feature dependence maximization for multi-view multi-label classification , 2022, Appl. Soft Comput..

[2] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[4] Yu Qiao,et al. Attention-Driven Dynamic Graph Convolutional Network for Multi-label Image Recognition , 2020, ECCV.

[5] Yuexian Zou,et al. Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network , 2020, IEEE Signal Processing Letters.

[6] Zhenzhong Chen,et al. A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction , 2020, WWW.

[7] Zheng-Jun Zha,et al. Learning and Fusing Multiple User Interest Representations for Micro-Video and Movie Recommendations , 2020, IEEE Transactions on Multimedia.

[8] Gang Cao,et al. Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification , 2019, Multimedia Tools and Applications.

[9] Zhiming Luo,et al. Manifold regularized discriminative feature selection for multi-label learning , 2019, Pattern Recognit..

[10] Hefeng Wu,et al. Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Xiaobo Wang,et al. Multi-View Multi-Label Learning with View-Specific Information Extraction , 2019, IJCAI.

[12] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Qi Tian,et al. Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[14] Zhi-Hua Zhou,et al. Multi-Label Learning with Global and Local Label Correlation , 2017, IEEE Transactions on Knowledge and Data Engineering.

[15] Yu-Chiang Frank Wang,et al. Learning Deep Latent Spaces for Multi-Label Classification , 2017, ArXiv.

[16] Tat-Seng Chua,et al. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model , 2016, ACM Multimedia.

[17] Cheng Li,et al. Conditional Bernoulli Mixtures for Multi-label Classification , 2016, ICML.

[18] Yun Fu,et al. Robust Multi-View Subspace Learning through Dual Low-Rank Decompositions , 2016, AAAI.

[19] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Andrew Zisserman,et al. Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[23] Murat Akbacak,et al. Softening quantization in bag-of-audio-words , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Vincent W. L. Tam,et al. A Cross-Attention BERT-Based Framework for Continuous Sign Language Recognition , 2022, IEEE Signal Processing Letters.

[25] Chenguang Song,et al. Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer , 2022, IEEE Signal Processing Letters.

[26] Changsheng Xu,et al. Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation , 2022, IEEE Transactions on Multimedia.

[27] Qingling Cai,et al. Multi-Label Classification of Fundus Images With Graph Convolutional Network and Self-Supervised Learning , 2021, IEEE Signal Processing Letters.