论文信息 - Event detection in consumer videos using GMM supervectors and SVMs

Event detection in consumer videos using GMM supervectors and SVMs

In large-scale multimedia event detection, complex target events are extracted from a large set of consumer-generated web videos taken in unconstrained environments. We devised a multimedia event detection method based on Gaussian mixture model (GMM) supervectors and support vector machines. A GMM supervector consists of the parameters of a GMM for the distribution of low-level features extracted from a video clip. A GMM is regarded as an extension of the bag-of-words framework to a probabilistic framework, and thus, it can be expected to be robust against the data insufficiency problem. We also propose a camera motion cancelled feature, which is a spatio-temporal feature robust against camera motions found in consumer-generated web videos. By combining these methods with the existing features, we aim to construct a high-performance event detection system. The effectiveness of our method is evaluated using TRECVID MED task benchmark.

Koichi Shinoda | Nakamasa Inoue | Yusuke Kamishima

[1] Koichi Shinoda,et al. Multimedia event detection using GMM supervectors and SVMS , 2012, 2012 19th IEEE International Conference on Image Processing.

[2] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[3] Florent Perronnin,et al. A similarity measure between unordered vector sets with application to image categorization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4] C. V. Jawahar,et al. Video retrieval by mimicking poses , 2012, ICMR '12.

[5] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[6] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7] Cordelia Schmid,et al. Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[8] Thomas Mensink,et al. Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[9] Alexander G. Hauptmann,et al. MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[10] Leonidas J. Guibas,et al. The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[11] Chong-Wah Ngo,et al. Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[12] Koichi Shinoda,et al. High-Level Feature Extraction Using SIFT GMMs and Audio Models , 2010, 2010 20th International Conference on Pattern Recognition.

[13] Koen E. A. van de Sande,et al. Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Mubarak Shah,et al. Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[15] Mubarak Shah,et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[16] A. G. Amitha Perera,et al. A Videography Analysis Framework for Video Retrieval and Summarization , 2012, BMVC.

[17] Koichi Shinoda,et al. A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[18] David G. Lowe,et al. Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19] Koichi Shinoda,et al. A Fast and Accurate Video Semantic-Indexing System Using Fast MAP Adaptation and GMM Supervectors , 2012, IEEE Transactions on Multimedia.

[20] Gang Hua,et al. IBM Research TRECVID-2010 Video Copy Detection and Multimedia Event Detection System , 2010, TRECVID.

[21] Thomas S. Huang,et al. Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[22] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[23] Nazli Ikizler-Cinbis,et al. Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[24] Florian Metze,et al. Informedia @ TRECVID 2011 , 2011 .

[25] FzMahmudi. 基于 SIFT 的软骨切片电镜图像拼接算法 , 2014 .

[26] Shuicheng Yan,et al. SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[27] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[28] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[29] Ying Li,et al. Content-based movie analysis and indexing based on audiovisual cues , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[30] Krystian Mikolajczyk,et al. Action recognition with appearance-motion features and fast search trees , 2011, Comput. Vis. Image Underst..

[31] Shuang Wu,et al. Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Zhu Liu,et al. Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[33] Alberto Del Bimbo,et al. Soccer highlights detection and recognition using HMMs , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[34] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[35] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[36] Javier Iparraguirre,et al. Speeded-up robust features (SURF) as a benchmark for heterogeneous computers , 2014, 2014 IEEE Biennial Congress of Argentina (ARGENCON).

[37] Chong-Wah Ngo,et al. Domain adaptive semantic diffusion for large scale context-based video annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[39] Douglas E. Sturim,et al. SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[40] Paul Over,et al. High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[41] Ehud Rivlin,et al. Robust Real-Time Unusual Event Detection using Multiple Fixed-Location Monitors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[43] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[44] Koichi Shinoda,et al. TokyoTech+Canon at TRECVID 2011 , 2011, TRECVID.

[45] Nicu Sebe,et al. Classifier-specific intermediate representation for multimedia tasks , 2012, ICMR '12.

[46] Luc Van Gool,et al. Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[47] Frédéric Jurie,et al. Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.