TRECVid 2013 Semantic Video Concept Detection by NTT-MD-DUT

In this report, we describe the approaches and experiments on TRECVid 2013 video concept detection conducted by NTT Media Intelligence Laboratories in collaboration with Dalian University of Technology. For this year’s task, we focused our efforts on two aspects. For the first aspect, we investigated the state-of-the-art machine learning algorithm and feature representation for large-scale concept classifiers construction. Specifically, we first evaluated a newly developed powerful image representation which has been successfully adopted in other visual classification task, i.e., Fisher Vector, for concept detection. Meanwhile, we are also interested in the using of deep learning technique for video classification, and to this end, we have tested various settings of deep learning and the results are reported. For the second aspect, we followed the subspace partition based framework we proposed in our last year work and to balance the precision and efficiency, we proposed a sparse soft-clustering method for ensemble learning, which can get the optimal replication parameter. We conducted experiments on TRECVid SIN task evaluation dataset and submitted 4 runs based on the above methods.

[1]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[2]  Yukinobu Taniguchi,et al.  Sampling of Web Images with Dictionary Coherence for Cross-Domain Concept Detection , 2013, MMM.

[3]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[4]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[5]  Shan Gao,et al.  The France Telecom Orange Labs (Beijing) Video Semantic Indexing Systems - TRECVID 2012 Notebook Paper , 2010, TRECVID.

[6]  Ho Joon Kim,et al.  Human Action Recognition Using a Modified Convolutional Neural Network , 2007, ISNN.

[7]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[8]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[9]  Tat-Seng Chua,et al.  Integrated graph-based semi-supervised multiple/single instance learning framework for image annotation , 2008, ACM Multimedia.

[10]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[11]  Tat-Seng Chua,et al.  ViewFocus: explore places of interests on Google maps using photos with view direction filtering , 2009, MM '09.

[12]  Jianping Fan,et al.  Concept-oriented indexing of video databases: toward semantic sensitive retrieval and browsing , 2004, IEEE Transactions on Image Processing.

[13]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Luca Maria Gambardella,et al.  Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[15]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[16]  Sheng Tang,et al.  Question Answering over Community-Contributed Web Videos , 2010, IEEE MultiMedia.

[17]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[18]  Masashi Morimoto,et al.  TRECVID 2011 Semantic Indexing Task By NTT-SL-ZJU , 2011, TRECVID.

[19]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[20]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[21]  Luca Maria Gambardella,et al.  Max-pooling convolutional neural networks for vision-based hand gesture recognition , 2011, 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).

[22]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Parsing , 2013, ArXiv.

[23]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[27]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[28]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[29]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[30]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[31]  Haojie Li,et al.  Combining global and local matching of multiple features for precise item image retrieval , 2012, Multimedia Systems.