Combining content and context information for video events classification and retrieval

Content-Based Video Retrieval has been a challenging problem and its performance relies on the modeling and representation of the video data and the underlying similarity metric. Most existing metrics evaluate pairwise shot similarity based only on shot perceptual content, which is denoted as content-based similarity. In this study, our concern is to recognize and detect video events that are “semantically similar”. Thus, we extend the content-based similarity to measure the conceptual content of shots. Here, conceptual content refers to the dynamic semantic concept which reflects a “human-action” regardless the perceptual/visual appearance. In addition, we propose a new similarity metric to make use of the shot contexts in video clips collection. The context of a shot is built by constructing a vector with each dimension representing the content similarity between the shot and any shot in the video collection. The context similarity between two videos is obtained by computing the similarity between the corresponding context vectors using the vector similarity functions. Furthermore, a linear and non-linear fusion schemes are introduced to compute the relative contributions of each similarity in the overall retrieval and classification process. Experimental results demonstrate that the use of the context similarity can significantly improve the retrieval performance.

[1]  Rangasami L. Kashyap,et al.  Models for motion-based video indexing and retrieval , 2000, IEEE Trans. Image Process..

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  James R. Bergen,et al.  Pyramid-based texture analysis/synthesis , 1995, Proceedings., International Conference on Image Processing.

[4]  Li Zhao,et al.  Key-frame extraction and shot retrieval using nearest feature line (NFL) , 2000, MULTIMEDIA '00.

[5]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[6]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[7]  Bashar Tahayna,et al.  An Efficient Method for Near-Duplicate Video Detection , 2008, PCM.

[8]  Saadat M. Alhashmi,et al.  Motion information for video retrieval , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[9]  Ming Yang,et al.  Detection driven adaptive multi-cue integration for multiple human tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[11]  Sanjeev R. Kulkarni,et al.  A framework for measuring video similarity and its application to video query by example , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[12]  Bernt Schiele,et al.  Probabilistic object recognition using multidimensional receptive field histograms , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[13]  David J. Fleet,et al.  Computation of component image velocity from local phase information , 1990, International Journal of Computer Vision.

[14]  Jeremy S. De Bonet,et al.  Multiresolution sampling procedure for analysis and synthesis of texture images , 1997, SIGGRAPH.

[15]  Lihi Zelnik-Manor,et al.  Statistical analysis of dynamic actions , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.