论文信息 - TokyoTechCanon at TRECVID 2013

TokyoTechCanon at TRECVID 2013

We aim at developing a high-performance system using Gaussian-mixture-model (GMM) supervectors and tree-structured GMMs [6, 7, 8] for the semantic indexing task [1, 2, 3, 4]. GMM supervectors corresponding to six types of audio and visual features are extracted from video shots. Tree-structured GMMs reduce the computational cost of maximum a posteriori (MAP) adaptation for estimating GMM parameters while keeping accuracy at high levels. This year, we improve our re-scoring method using video-clip scores by introducing a scaling parameter. Here, the video-clip score is defined as the maximum value of shot scores among all the shots in a video clip. Our best result was 28.4% in terms of Mean InfAP, which was ranked third among participating teams in the semantic indexing task.

[1] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[2] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[3] Paul Over,et al. Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[4] Cordelia Schmid,et al. Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[5] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[6] Cordelia Schmid,et al. A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[7] Paul Over,et al. High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[8] Steve Young,et al. The HTK book , 1995 .

[9] Shuicheng Yan,et al. An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10] Cordelia Schmid,et al. Coloring Local Feature Extraction , 2006, ECCV.

[11] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[12] Koichi Shinoda,et al. Event detection in consumer videos using GMM supervectors and SVMs , 2013, EURASIP J. Image Video Process..

[13] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Koichi Shinoda,et al. A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[15] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17] Pavel Zemcík,et al. Annotating Images with Suggestions - User Study of a Tagging System , 2012, ACIVS.

[18] Georges Quénot,et al. Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[19] Steve Young,et al. The HTK book version 3.4 , 2006 .

[20] Michael Isard,et al. ICONDENSATION: Unifying Low-Level and High-Level Tracking in a Stochastic Framework , 1998, ECCV.

[21] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23] Berthold K. P. Horn,et al. Determining Optical Flow , 1981, Other Conferences.

[24] Stéphane Ayache,et al. Video Corpus Annotation Using Active Learning , 2008, ECIR.

[25] O. Chum,et al. ENHANCING RANSAC BY GENERALIZED MODEL OPTIMIZATION Onďrej Chum, Jǐ , 2003 .

[26] Koichi Shinoda,et al. TokyoTech+Canon at TRECVID 2011 , 2011, TRECVID.

[27] Andrea Vedaldi,et al. Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[28] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[29] A. Smeaton,et al. TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[30] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[32] Matti Pietikäinen,et al. A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[33] Koichi Shinoda,et al. High-Level Feature Extraction Using SIFT GMMs and Audio Models , 2010, 2010 20th International Conference on Pattern Recognition.

[34] Koichi Shinoda,et al. A Fast and Accurate Video Semantic-Indexing System Using Fast MAP Adaptation and GMM Supervectors , 2012, IEEE Transactions on Multimedia.