Kobe University, NICT and University of Siegen at TRECVID 2017 AVS Task

This paper presents the result of the TRECVID 2017 AVS task by kobe nict siegen team. Consisting of three research institutes Kobe University, NICT and University of Siegen. We submitted the following three runs. 1) kobe nict siegen D M 1: This run uses feature vectors extracted by a pre-trained convolutional neural network (CNN) as input for a small-scale multi-layer neural network called micro neural network (microNN). The microNN is a lightweight detector fine-tuned to a target concept using a balanced set of positive and negative data from ImageNet and IACC video datasets. Finally, the results of several microNNs are combined to generate the ranked result. 2) kobe nict siegen D M 2: This run is basically identical to kobe nict siegen D M 1, but applys additional motion features. The motion features are extracted by a motion CNN which involves biologically-inspired motion thresholding and competitive learning. 3) kobe nict siegen D M 3: This run is a degraded version of kobe nict siegen D M 1, and apply fully averaged pooling to feature vectors extracted with CNN, the dimention of the vector is reduced from 3× 3× 2048 to 2048. We also use a word2vec model to find synonyms of concepts in order to improve the performance.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[3]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[6]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[7]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[8]  E. Callaway,et al.  Parallel processing strategies of the primate visual system , 2009, Nature Reviews Neuroscience.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[11]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  David Zipser,et al.  Feature Discovery by Competive Learning , 1986, Cogn. Sci..

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[19]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).