ShotTagger: tag location for internet videos

Social video sharing websites allow users to annotate videos with descriptive keywords called tags, which greatly facilitate video search and browsing. However, many tags only describe part of the video content, without any temporal indication on when the tag actually appears. Currently, there is very little research on automatically assigning tags to shot-level segments of a video. In this paper, we leverage user's tags as a source to analyze the content within the video and develop a novel system named ShotTagger to assign tags at the shot level. There are two steps to accomplish the location of tags at shot level. The first is to estimate the distribution of tags within the video, which is based on a multiple instance learning framework. The second is to perform the semantic correlation of a tag with other tags in a video in an optimization framework and impose the temporal smoothness across adjacent video shots to refine the tagging results at shot level. We present different applications to demonstrate the usefulness of the tag location scheme in searching, and browsing of videos. A series of experiments conducted on a set of Youtube videos has demonstrated the feasibility and effectiveness of our approach.

[1]  Grant Schindler,et al.  Internet video category recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[4]  Xian-Sheng Hua,et al.  Smart video player , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[5]  Shih-Fu Chang,et al.  Revision of LSCOM Event/Activity Annotations , 2006 .

[6]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[7]  Mark Sanderson,et al.  Automatic video tagging using content redundancy , 2009, SIGIR.

[8]  Xian-Sheng Hua,et al.  Towards a Relevant and Diverse Search of Social Images , 2010, IEEE Transactions on Multimedia.

[9]  Tao Mei,et al.  Graph-based semi-supervised learning with multiple labels , 2009, J. Vis. Commun. Image Represent..

[10]  Tat-Seng Chua,et al.  ATMRA: An Automatic Temporal Multi-resolution Analysis Framework for Shot Boundary Detection , 2003, MMM.

[11]  Chong-Wah Ngo,et al.  Towards google challenge: combining contextual and social information for web video categorization , 2009, ACM Multimedia.

[12]  Markus Koch,et al.  TubeFiler: an automatic web video categorizer , 2009, ACM Multimedia.

[13]  Adrian Ulges,et al.  Identifying relevant frames in weakly labeled videos for training concept detectors , 2008, CIVR '08.

[14]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Hui Zhang,et al.  Localized Content-Based Image Retrieval , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[18]  David Elliott,et al.  In the Wild , 2010 .

[19]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[20]  Markus Koch,et al.  Learning automatic concept detectors from online video , 2010, Comput. Vis. Image Underst..

[21]  Meng Wang,et al.  Active learning in multimedia annotation and retrieval: A survey , 2011, TIST.

[22]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[23]  Bing-Yu Chen,et al.  SmartPlayer: user-centric video fast-forwarding , 2009, CHI.

[24]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[25]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[26]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[27]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[29]  Xian-Sheng Hua,et al.  Multi-modality web video categorization , 2007, MIR '07.

[30]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..