The YouTube-8M Kaggle Competition: Challenges and Methods

We took part in the YouTube-8M Video Understanding Challenge hosted on Kaggle, and achieved the 10th place within less than one month's time. In this paper, we present an extensive analysis and solution to the underlying machine-learning problem based on frame-level data, where major challenges are identified and corresponding preliminary methods are proposed. It's noteworthy that, with merely the proposed strategies and uniformly-averaging multi-crop ensemble was it sufficient for us to reach our ranking. We also report the methods we believe to be promising but didn't have enough time to train to convergence. We hope this paper could serve, to some extent, as a review and guideline of the YouTube-8M multi-label video classification benchmark, inspiring future attempts and research.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[12]  Ji Wu,et al.  The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge , 2017, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[14]  Miha Skalic,et al.  Deep Learning Methods for Efficient Large Scale Video Labeling , 2017, ArXiv.

[15]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[16]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[17]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[20]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22]  Raymond J. Mooney,et al.  Experiments on Ensembles with Missing and Noisy Data , 2004, Multiple Classifier Systems.