论文信息 - YouTubeEvent: On large-scale video event classification

YouTubeEvent: On large-scale video event classification

In this work, we investigate the problem of general event classification from uncontrolled YouTube videos. It is a challenging task due to the number of possible categories and large intra-class variations. On one hand, how to define proper event category labels and how to obtain training samples for these categories need to be explored; on the other hand, it is non-trivial to achieve satisfactory classification performance. To address these problems, a text mining pipeline is first proposed to automatically discover a collection of video event categories. Part-of-Speech (POS) analysis is applied to YouTube video titles and descriptions, and WordNet hierarchy is employed to refine the category selection. This results in 29, 163 video event categories. A POS-based query method is then applied to video titles, and 6, 538, 319 video samples are obtained from YouTube to represent these categories. To improve classification performance, video content-based features are complemented with scores from a set of classifiers, which can be regarded as a type of high-level features. Extensive evaluations demonstrate the effectiveness of the proposed automatic event label mining technique, and our feature fusion scheme shows encouraging classification results.

[1] Baoxin Li,et al. YouTubeCat: Learning to categorize wild web videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2] Atreyi Kankanhalli,et al. Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[3] Jason Weston,et al. Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[4] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[5] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[6] David Nistér,et al. Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7] Jitendra Malik,et al. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[8] M.G. Bellanger,et al. Digital processing of speech signals , 1980, Proceedings of the IEEE.

[9] Luciano Sbaiz,et al. Finding meaning on YouTube: Tag recommendation and category discovery , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10] Antonio Torralba,et al. LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[12] David Elliott,et al. In the Wild , 2010 .

[13] Yihong Gong,et al. Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[15] Antonio Torralba,et al. Semantic Label Sharing for Learning with Many Categories , 2010, ECCV.

[16] Pietro Perona,et al. Learning and using taxonomies for fast visual categorization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Yang Song,et al. Taxonomic classification for web-based videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20] Ronen Basri,et al. Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[22] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[23] Fei-Fei Li,et al. What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[24] Jiebo Luo,et al. Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25] John F. Canny,et al. A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.