What Is Happening in the Video? —Annotate Video by Sentence

Due to the popularity of online video sharing Web sites such as YouTube, millions of users have treated online video as a source of information and entertainment. Therefore, video annotation has evoked great interest in the past few years. In this paper, we propose a four-step approach to automatically annotate video shots with sentences. The first step is video preprocessing, converting video shot into a sequence of frame images. The second step is to find related candidate elements of the sentence about the video contents. The main elements in the sentence are objects, events, scenes, and modifiers. These candidate elements are gained by searching for similar images with the video frames in our collected image data sets instead of video data sets. The third step is to select the best elements among these candidate ones by a weighted scoring algorithm. The final step is to construct a sentence with the help of a correlation graph algorithm to analyze the relationships among the best elements. The experimental results indicate that our method is effective to annotate videos with sentences. What is more, the weighted scoring algorithm and the correlation graph algorithm that we propose are efficient in developing the experimental performance.

[1]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xueming Qian,et al.  Scalable Mobile Image Retrieval by Exploring Contextual Saliency , 2015, IEEE Transactions on Image Processing.

[3]  B. S. Manjunath,et al.  Automatic video annotation through search and mining , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[4]  Jean-Daniel Fekete,et al.  PolemicTweet: Video Annotation and Analysis through Tagged Tweets , 2013, INTERACT.

[5]  Heng Ji,et al.  Exploring Context and Content Links in Social Media: A Latent Space Method , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chong-Wah Ngo,et al.  Towards google challenge: combining contextual and social information for web video categorization , 2009, ACM Multimedia.

[8]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[9]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[10]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  B. S. Manjunath,et al.  Video Annotation Through Search and Graph Reinforcement Mining , 2010, IEEE Transactions on Multimedia.

[12]  Antonio Albiol,et al.  Ground truth annotation of traffic video data , 2013, Multimedia Tools and Applications.

[13]  Xueming Qian,et al.  Object Categorization Using Hierarchical Wavelet Packet Texture Descriptors , 2009, 2009 11th IEEE International Symposium on Multimedia.

[14]  Adrian Ulges,et al.  Content-based Video Tagging for Online Video Portals ∗ , 2007 .

[15]  Yasuo Kuniyoshi,et al.  Automatic sentence generation from images , 2011, MM '11.

[16]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[17]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[18]  Yuan Yan Tang,et al.  GPS Estimation for Places of Interest From Social Users' Uploaded Photos , 2013, IEEE Transactions on Multimedia.

[19]  Meng Wang,et al.  Tag Tagging: Towards More Descriptive Keywords of Image Content , 2011, IEEE Transactions on Multimedia.

[20]  Qi Tian,et al.  Image Annotation by Latent Community Detection and Multikernel Learning , 2015, IEEE Transactions on Image Processing.

[21]  Meng Wang,et al.  Correlative Linear Neighborhood Propagation for Video Annotation , 2009, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Zi Huang,et al.  Transfer tagging from image to video , 2011, ACM Multimedia.

[23]  Tao Mei,et al.  Automatic Video Genre Categorization using Hierarchical SVM , 2006, 2006 International Conference on Image Processing.

[24]  Howon Lee,et al.  WalkieTagging: Efficient video annotation method based on spoken words for smart devices , 2012, 2012 IEEE International Conference on Pervasive Computing and Communications Workshops.

[25]  Amr Ahmed,et al.  A framework for automatic semantic video annotation , 2014, Multimedia Tools and Applications.

[26]  Yuan Yan Tang,et al.  GPS Estimation from Users' Photos , 2013, MMM.

[27]  Tao Mei,et al.  Modeling and Mining of Users' Capture Intention for Home Videos , 2007, IEEE Transactions on Multimedia.

[28]  Yuan Yan Tang,et al.  Social Image Tagging With Diverse Semantics , 2014, IEEE Transactions on Cybernetics.

[29]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[31]  Meng Wang,et al.  Automatic video annotation by semi-supervised learning with kernel density estimation , 2006, MM '06.

[32]  Xian-Sheng Hua,et al.  Multi-modality web video categorization , 2007, MIR '07.

[33]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Qi Tian,et al.  Packing and Padding: Coupled Multi-index for Accurate Image Retrieval , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Changsheng Xu,et al.  Verb-Object Concepts Image Classification via Hierarchical Nonnegative Graph Embedding , 2013, MMM.

[36]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[37]  Matthijs Douze,et al.  Bag-of-colors for improved image search , 2011, ACM Multimedia.

[38]  Xueming Qian,et al.  LCMKL: latent-community and multi-kernel learning based image annotation , 2013, CIKM.

[39]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[40]  Xueming Qian,et al.  HWVP: hierarchical wavelet packet descriptors and their applications in scene categorization and semantic concept retrieval , 2012, Multimedia Tools and Applications.

[41]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Meng Wang,et al.  In-video product annotation with web information mining , 2012, TOMCCAP.

[43]  Thomas S. Huang,et al.  Web-Scale Multimedia Information Networks , 2012, Proceedings of the IEEE.

[44]  Qi Tian,et al.  Coupled Binary Embedding for Large-Scale Image Retrieval , 2014, IEEE Transactions on Image Processing.

[45]  David Dagan Feng,et al.  What is happening: annotating images with verbs , 2012, ACM Multimedia.