Video object graph: A novel semantic level representation for videos

In this paper, we propose a novel object based graph framework for video representation. The proposed framework describes a video as a graph, in which objects are represented by nodes, and their relations between objects are represented by edges. We investigated several spatial and temporal features as the graph node attributes, and different features of spatial-temporal relationship between objects as the edge attributes. To overcome the influence of the camera motion on the detected object motion, a global motion estimation and correction approach is proposed to reveal the true object trajectory. We further propose to evaluate the similarity between two videos by establishing the object correspondence between two object graphs through graph matching. Results show that our method outperforms other video representation frameworks in matching videos with the same semantic content. The proposed framework provides a compact and robust semantic descriptor for a video, which has broad appeal to many video retrieval applications.

[1]  Ruonan Li,et al.  Discriminative virtual views for cross-view action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[3]  Fei-Fei Li,et al.  Learning Temporal Embeddings for Complex Video Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Jitendra Malik,et al.  Shape Context: A New Descriptor for Shape Matching and Object Recognition , 2000, NIPS.

[5]  Fernando De la Torre,et al.  Factorized Graph Matching , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[9]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  John Wright,et al.  RASL: Robust Alignment by Sparse and Low-Rank Decomposition for Linearly Correlated Images , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jung-Hwan Oh,et al.  Clustering of Video Objects by Graph Matching , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Stan Sclaroff,et al.  Space-time tree ensemble for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Huchuan Lu,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Online Object Tracking with Sparse Prototypes , 2022 .