A spatiotemporal decomposition strategy for personal home video management

With the advent and proliferation of low cost and high performance digital video recorder devices, an increasing number of personal home video clips are recorded and stored by the consumers. Compared to image data, video data is lager in size and richer in multimedia content. Efficient access to video content is expected to be more challenging than image mining. Previously, we have developed a content-based image retrieval system and the benchmarking framework for personal images. In this paper, we extend our personal image retrieval system to include personal home video clips. A possible initial solution to video mining is to represent video clips by a set of key frames extracted from them thus converting the problem into an image search one. Here we report that a careful selection of key frames may improve the retrieval accuracy. However, because video also has temporal dimension, its key frame representation is inherently limited. The use of temporal information can give us better representation for video content at semantic object and concept levels than image-only based representation. In this paper we propose a bottom-up framework to combine interest point tracking, image segmentation and motion-shape factorization to decompose the video into spatiotemporal regions. We show an example application of activity concept detection using the trajectories extracted from the spatio-temporal regions. The proposed approach shows good potential for concise representation and indexing of objects and their motion in real-life consumer video.

[1]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  Remco C. Veltkamp,et al.  Content-based image retrieval systems: A survey , 2000 .

[4]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[5]  Ali N. Akansu,et al.  Low-level motion activity features for semantic characterization of video , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[6]  Liang-Tien Chia,et al.  A new motion histogram to index motion content in video segments , 2005, Pattern Recognit. Lett..

[7]  Yair Weiss,et al.  Incorporating Non-motion Cues into 3D Motion Segmentation , 2006, ECCV.

[8]  Takeo Kanade,et al.  A Multibody Factorization Method for Independently Moving Objects , 1998, International Journal of Computer Vision.

[9]  David S. Doermann,et al.  Video retrieval using spatio-temporal descriptors , 2003, MULTIMEDIA '03.

[10]  Thierry Pun,et al.  Performance evaluation in content-based image retrieval: overview and proposals , 2001, Pattern Recognit. Lett..

[11]  Larry S. Davis,et al.  Non-parametric Model for Background Subtraction , 2000, ECCV.

[12]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Carole Dulong,et al.  Requirements for benchmarking personal image retrieval systems , 2006, Electronic Imaging.

[14]  Alexander G. Hauptmann,et al.  Successful approaches in the TREC video retrieval evaluations , 2004, MULTIMEDIA '04.

[15]  Stéphane Marchand-Maillet,et al.  Benchmarking Image Retrieval Applications , 2004 .

[16]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.