Domain adaptive semantic diffusion for large scale context-based video annotation

Learning to cope with domain change has been known as a challenging problem in many real-world applications. This paper proposes a novel and efficient approach, named domain adaptive semantic diffusion (DASD), to exploit semantic context while considering the domain-shift-of-context for large scale video concept annotation. Starting with a large set of concept detectors, the proposed DASD refines the initial annotation results using graph diffusion technique, which preserves the consistency and smoothness of the annotation over a semantic graph. Different from the existing graph learning methods which capture relations among data samples, the semantic graph treats concepts as nodes and the concept affinities as the weights of edges. Particularly, the DASD approach is capable of simultaneously improving the annotation results and adapting the concept affinities to new test data. The adaptation provides a means to handle domain change between training and test data, which occurs very often in video annotation task. We conduct extensive experiments to improve annotation results of 374 concepts over 340 hours of videos from TRECVID 2005-2007 data sets. Results show consistent and significant performance gain over various baselines. In addition, the proposed approach is very efficient, completing DASD over 374 concepts within just 2 milliseconds for each video shot on a regular PC.

[1]  Jitendra Malik,et al.  Scale-Space and Edge Detection Using Anisotropic Diffusion , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Lior Wolf,et al.  A Critical View of Context , 2006, International Journal of Computer Vision.

[3]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[4]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[5]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Mubarak Shah,et al.  Improving Semantic Concept Detection and Retrieval using Contextual Estimates , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[7]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[8]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[9]  Ronald R. Coifman,et al.  Regularization on Graphs with Function-adapted Diffusion Processes , 2008, J. Mach. Learn. Res..

[10]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[13]  Alexander Hauptmann,et al.  How many high-level concepts will fill the semantic gap in video retrieval ? , 2007 .

[14]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[15]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[16]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[17]  Nando de Freitas,et al.  A Statistical Model for General Contextual Object Recognition , 2004, ECCV.

[18]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.

[19]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[21]  Shih-Fu Chang,et al.  Context-Based Concept Fusion with Boosted Conditional Random Fields , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[23]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[24]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[25]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[26]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[27]  Luc Van Gool,et al.  Determination of Optical Flow and its Discontinuities using Non-Linear Diffusion , 1994, ECCV.

[28]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.