Joint Learning of NNeXtVLAD, CNN and Context Gating for Micro-Video Venue Classification

Currently, micro-videos have grown explosively on various online social platforms. Accordingly, how to encode them to yield effective representation attracts our attention. NeXtVLAD is such an effective network that aggregates frame-level features into a compact supervector. However, the discriminant capability of such a supervector is still limited due to the lack of non-linear transformation and L2 normalization at the head and tail of original NeXtVLAD network, respectively. In order to address such problems, we propose an improved neural network architecture, normalized NeXtVLAD (NNeXtVLAD), which is extended with ReLU function and L2 normalization. In the light of such a new network, we build up an end-to-end framework which jointly learns NNeXtVLAD, CNN layer, and context gating for micro-video venue classification. Specifically, we first apply NNeXtVLAD layers as three-stream architecture to aggregate visual, acoustic, and textual features. We then pack and embed the aggregated features into CNN layer for enhancing the sparse concept-level representation. Finally, context gating is used to capture the interdependency among different network activations. Extensive experimental results on a real-world micro-video dataset exhibit that our proposed model significantly outperforms the state-of-the-art baselines in terms of both Micro-F1 and Macro-F1 scores.

[1]  MengChu Zhou,et al.  Incorporation of Efficient Second-Order Solvers Into Latent Factor Models for Accurate Prediction of Missing QoS Data , 2018, IEEE Transactions on Cybernetics.

[2]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Dong Xu,et al.  Learning Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection , 2019, IEEE Transactions on Image Processing.

[4]  Meng Wang,et al.  Low-Rank Multi-View Embedding Learning for Micro-Video Popularity Prediction , 2018, IEEE Transactions on Knowledge and Data Engineering.

[5]  MengChu Zhou,et al.  A Nonnegative Latent Factor Model for Large-Scale Sparse Matrices in Recommender Systems via Alternating Direction Method , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[6]  MengChu Zhou,et al.  Generating Highly Accurate Predictions for Missing QoS Data via Aggregating Nonnegative Latent Factor Models , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Tat-Seng Chua,et al.  Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model , 2016, ACM Multimedia.

[9]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[10]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[11]  MengChu Zhou,et al.  Temporal Pattern-Aware QoS Prediction via Biased Non-Negative Latent Factorization of Tensors , 2020, IEEE Transactions on Cybernetics.

[12]  Xiaoming Xi,et al.  Getting More from One Attractive Scene: Venue Retrieval in Micro-videos , 2018, PCM.

[13]  Charless C. Fowlkes,et al.  The Open World of Micro-Videos , 2016, ArXiv.

[14]  Meng Liu,et al.  Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning , 2019, IEEE Transactions on Image Processing.

[15]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Nadia Mana,et al.  Automatic prediction of individual performance from "thin slices" of social behavior , 2009, ACM Multimedia.

[17]  Bin Luo,et al.  Tag refinement of micro-videos by learning from multiple data sources , 2017, Multimedia Tools and Applications.

[18]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[19]  Rossano Schifanella,et al.  6 Seconds of Sound and Vision: Creativity in Micro-videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  John Z. Zhang,et al.  Enhancing multi-label music genre classification through ensemble techniques , 2011, SIGIR.

[21]  Jingyuan Chen,et al.  Multi-Modal Learning: Study on A Large-Scale Micro-Video Data Collection , 2016, ACM Multimedia.

[22]  Meng Wang,et al.  Towards Micro-video Understanding by Joint Sequential-Sparse Modeling , 2017, ACM Multimedia.

[23]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tat-Seng Chua,et al.  Shorter-is-Better: Venue Category Estimation from Micro-Video , 2016, ACM Multimedia.

[25]  Gong Cheng,et al.  RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shuqiang Jiang,et al.  Hierarchy-Dependent Cross-Platform Multi-View Feature Learning for Venue Category Prediction , 2018, IEEE Transactions on Multimedia.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[30]  Shuai Li,et al.  Symmetric and Nonnegative Latent Factor Models for Undirected, High-Dimensional, and Sparse Networks in Industrial Applications , 2017, IEEE Transactions on Industrial Informatics.

[31]  Lei Guo,et al.  When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[32]  Qi Tian,et al.  Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[33]  Wei Liu,et al.  Joint Learning of LSTMs-CNN and Prototype for Micro-video Venue Classification , 2018, PCM.

[34]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[35]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[36]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.