A framework for automatic semantic video annotation

The rapidly increasing quantity of publicly available videos has driven research into developing automatic tools for indexing, rating, searching and retrieval. Textual semantic representations, such as tagging, labelling and annotation, are often important factors in the process of indexing any video, because of their user-friendly way of representing the semantics appropriate for search and retrieval. Ideally, this annotation should be inspired by the human cognitive way of perceiving and of describing videos. The difference between the low-level visual contents and the corresponding human perception is referred to as the ‘semantic gap’. Tackling this gap is even harder in the case of unconstrained videos, mainly due to the lack of any previous information about the analyzed video on the one hand, and the huge amount of generic knowledge required on the other. This paper introduces a framework for the Automatic Semantic Annotation of unconstrained videos. The proposed framework utilizes two non-domain-specific layers: low-level visual similarity matching, and an annotation analysis that employs commonsense knowledgebases. Commonsense ontology is created by incorporating multiple-structured semantic relationships. Experiments and black-box tests are carried out on standard video databases for action recognition and video information retrieval. White-box tests examine the performance of the individual intermediate layers of the framework, and the evaluation of the results and the statistical analysis show that integrating visual similarity matching with commonsense semantic relationships provides an effective approach to automated video annotation.

[1]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[2]  B. S. Manjunath,et al.  Content-based search of video using color, texture, and motion , 1997, Proceedings of International Conference on Image Processing.

[3]  Balakrishnan Chandrasekaran,et al.  What are ontologies, and why do we need them? , 1999, IEEE Intell. Syst..

[4]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  M. Ibrahim Sezan,et al.  A semantic event-detection approach and its application to detecting hunts in wildlife vide , 2000, IEEE Trans. Circuits Syst. Video Technol..

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  John R. Smith,et al.  A multi-modal system for the retrieval of semantic video events , 2004, Comput. Vis. Image Underst..

[8]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[9]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[10]  Alan F. Smeaton,et al.  A usage study of retrieval modalities for video shot retrieval , 2006, Inf. Process. Manag..

[11]  Wei-Hao Lin,et al.  A Hybrid Approach to Improving Semantic Extraction of News Video , 2007, International Conference on Semantic Computing (ICSC 2007).

[12]  Gerald Friedland Current Multimedia Data Formats and Semantic Computing: A Practical Example and the Challenges for the Future , 2007 .

[13]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Pinar Duygulu Sahin,et al.  Human Action Recognition Using Distribution of Oriented Rectangular Patches , 2007, Workshop on Human Motion.

[15]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Wei-Hao Lin,et al.  A Hybrid Approach to Improving Semantic Extraction of News Video , 2007 .

[17]  Alberto Del Bimbo,et al.  Semantic annotation and retrieval of video events using multimedia ontologies , 2007, International Conference on Semantic Computing (ICSC 2007).

[18]  Hsin-Hsi Chen,et al.  Combining WordNet and ConceptNet for Automatic Query Expansion: A Learning Approach , 2008, AIRS.

[19]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.

[20]  Bo Zhang,et al.  Semantic Concept Learning through Massive Internet Video Mining , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[21]  Mubarak Shah,et al.  Content based video matching using spatiotemporal volumes , 2008, Comput. Vis. Image Underst..

[22]  Amr Ahmed,et al.  VisualNet: Commonsense knowledgebase for video and image indexing and retrieval application , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[23]  Amr Ahmed,et al.  Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases , 2009, 2009 IEEE International Conference on Signal and Image Processing Applications.

[24]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Amr Ahmed Video Representation and Processing for Multimedia Data Mining , 2009, Semantic Mining Technologies for Multimedia Databases.

[26]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Amr Ahmed,et al.  Video databases annotation enhancing using commonsense knowledgebases for indexing and retrieval , 2009 .

[28]  Mark Sanderson,et al.  Automatic video tagging using content redundancy , 2009, SIGIR.

[29]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Trevor Darrell,et al.  Gaussian Processes for Object Categorization , 2010, International Journal of Computer Vision.

[32]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Chong-Wah Ngo,et al.  VIREO/DVMM at TRECVID 2009: High-Level Feature Extraction, Automatic Video Search, and Content-Based Copy Detection , 2009, TRECVID.

[34]  Markus Koch,et al.  Learning automatic concept detectors from online video , 2010, Comput. Vis. Image Underst..

[35]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[36]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[37]  Pietro Perona,et al.  Learning Object Categories From Internet Image Searches , 2010, Proceedings of the IEEE.

[38]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[39]  Chong-Wah Ngo,et al.  On the Annotation of Web Videos by Efficient Near-Duplicate Search , 2010, IEEE Transactions on Multimedia.

[40]  Chong-Wah Ngo,et al.  Concept-Driven Multi-Modality Fusion for Video Search , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Carles Ventura,et al.  Hierarchical Navigation and Visual Search for Video Keyframe Retrieval , 2012, MMM.