Fine-granularity semantic video annotation: An approach based on automatic shot level concept detection and object recognition

Purpose – A fine‐grained video content indexing, retrieval, and adaptation requires accurate metadata describing the video structure and semantics to the lowest granularity, i.e. to the object level. The authors address these requirements by proposing semantic video content annotation tool (SVCAT) for structural and high‐level semantic video annotation. SVCAT is a semi‐automatic MPEG‐7 standard compliant annotation tool, which produces metadata according to a new object‐based video content model introduced in this work. Videos are temporally segmented into shots and shots level concepts are detected automatically using ImageNet as background knowledge. These concepts are used as a guide to easily locate and select objects of interest which are then tracked automatically to generate an object level metadata. The integration of shot based concept detection with object localization and tracking drastically alleviates the task of an annotator. The paper aims to discuss these issues.Design/methodology/approach...

[1]  Vladimir Vezhnevets,et al.  “GrowCut” - Interactive Multi-Label N-D Image Segmentation By Cellular Automata , 2005 .

[2]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[3]  Rik Van de Walle,et al.  Annotation based personalized adaptation and presentation of videos for mobile applications , 2010, Multimedia Tools and Applications.

[4]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  J. Sethian,et al.  Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations , 1988 .

[6]  Paul A. Viola,et al.  Unsupervised improvement of visual detectors using cotraining , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  David Coquil,et al.  VAnalyzer: a MPEG-7 based Semantic Video Annotation Tool , 2010 .

[8]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[9]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[10]  W. Clem Karl,et al.  Real-time tracking using level sets , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Zhenjiang Miao,et al.  A Two-View Concept Correlation Based Video Annotation Refinement , 2012, IEEE Signal Processing Letters.

[12]  Lionel Brunie,et al.  Personalized video adaptation framework (PIAF): high-level semantic adaptation , 2014, Multimedia Tools and Applications.

[13]  Tony F. Chan,et al.  Active contours without edges , 2001, IEEE Trans. Image Process..

[14]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[15]  Gour C. Karmakar,et al.  Region-Based Shape Incorporation for Probabilistic Spatio-Temporal Video Object Segmentation , 2006, 2006 International Conference on Image Processing.

[16]  David Coquil,et al.  Semantic video content annotation at the object level , 2012, MoMM '12.

[17]  Yiannis Kompatsiaris,et al.  A Survey of Semantic Image and Video Annotation Tools , 2011, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution.

[18]  Xin Li,et al.  Contour-based object tracking with occlusion handling in video acquired using mobile cameras , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Tony F. Chan,et al.  An Active Contour Model without Edges , 1999, Scale-Space.

[20]  Hedda Lausberg,et al.  Methods in Gesture Research: , 2009 .

[21]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[22]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[23]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[24]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[25]  Markus Koch,et al.  Learning automatic concept detectors from online video , 2010, Comput. Vis. Image Underst..

[26]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[27]  S. Osher,et al.  Algorithms Based on Hamilton-Jacobi Formulations , 1988 .

[28]  Joshua R. Smith,et al.  Visual annotation tool for multimedia content description , 2000, SPIE Optics East.

[29]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[30]  Pierre Gançarski,et al.  Video Object Mining: Issues and Perspectives , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[31]  Alan L. Yuille,et al.  Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multiband Image Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..