Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-label Video Classification

Here we show neural network based methods, which combine multimodal sequential inputs effectively and classify the inputs into multiple categories. Two key ideas are (1) to select informative frames among a sequence using attention mechanism and (2) to utilize correlation information between labels to solve multi-label classification problems. The attention mechanism is used in both modality (spatio) and sequential (temporal) dimensions to ignore noisy and meaningless frames. Furthermore, to tackle fundamental problems induced by independently predicting each label in conventional multi-label classification methods, the proposed method considers the dependencies among the labels by decomposing joint probability of labels into conditional terms. From the experimental results (5th in the Kaggle competition), we discuss how the suggested methods operate in the YouTube-8M Classification Task, what insights they have, and why they succeed or fail.

[1]  Gunhee Kim,et al.  Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[2]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[3]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Qin Jin,et al.  Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[7]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Johannes Fürnkranz,et al.  Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification , 2017, NIPS.

[11]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[14]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[16]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[17]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[18]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[19]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  Gunhee Kim,et al.  Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset , 2017, ArXiv.

[21]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.