Video retrieval using spatio-temporal descriptors

This paper describes a novel methodology for implementing video search functions such as retrieval of near-duplicate videos and recognition of actions in surveillance video. Videos are divided into half-second clips whose stacked frames produce 3D space-time volumes of pixels. Pixel regions with consistent color and motion properties are extracted from these 3D volumes by a threshold-free hierarchical space-time segmentation technique. Each region is then described by a high-dimensional point whose components represent the position, motion and, when possible, color of the region. In the indexing phase for a video database, these points are assigned labels that specify their video clip of origin. All the labeled points for all the clips are stored into a single binary tree for efficient $k$-nearest neighbor retrieval. The retrieval phase uses video segments as queries. Half-second clips of these queries are again segmented to produce sets of points, and for each point the labels of its nearest neighbors are retrieved. The labels that receive the largest numbers of votes correspond to the database clips that are the most similar to the query video segment. We illustrate this approach for video indexing and retrieval and for action recognition. First, we describe retrieval experiments for dynamic logos, and for video queries that differ from the indexed broadcasts by the addition of large overlays. Then we describe experiments in which office actions (such as pulling and closing drawers, taking and storing items, picking up and putting down a phone) are recognized. Color information is ignored to insure independence to people's appearance. One of the distinct advantages of using this approach for action recognition is that there is no need for detection or recognition of body parts.

[1]  Eric Bruno,et al.  Global Motion Fourier Series Expansion for Video Indexing and Retrieval , 2000, VISUAL.

[2]  Patrick Bouthemy,et al.  Real-Time Tracking of Moving Persons by Exploiting Spatio-Temporal Image Slices , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  M. Alex O. Vasilescu,et al.  Recognizing action events from multiple viewpoints , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[5]  Daniel DeMenthon,et al.  SPATIO-TEMPORAL SEGMENTATION OF VIDEO BY HIERARCHICAL MEAN SHIFT ANALYSIS , 2002 .

[6]  Alberto Del Bimbo,et al.  Video retrieval based on dynamics of color flows , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[7]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Cordelia Schmid,et al.  Local Grayvalue Invariants for Image Retrieval , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Eric Bruno,et al.  Video structuring, indexing and retrieval based on global motion wavelet coefficients , 2002, Object recognition supported by user interaction for service robots.

[11]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Claudio S. Pinhanez,et al.  Human action detection using PNF propagation of temporal constraints , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[13]  Parlitz,et al.  Fast nearest-neighbor searching for nonlinear signal processing , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[14]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[15]  Andrea Salgian,et al.  A cubist approach to object recognition , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[16]  Avideh Zakhor,et al.  Motion indexing of video , 1997, Proceedings of International Conference on Image Processing.

[17]  Tieniu Tan,et al.  Spatio-temporal segmentation for video surveillance , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[18]  Robert C. Bolles,et al.  Epipolar-plane image analysis: An approach to determining structure from motion , 1987, International Journal of Computer Vision.

[19]  Charles R. Dyer,et al.  Computing spatiotemporal relations for dynamic perceptual organization , 1993 .

[20]  Amarnath Gupta,et al.  Virage video engine , 1997, Electronic Imaging.

[21]  Wolfgang Effelsberg,et al.  VisualGREP: A Systematic Method to Compare and Retrieve Video Sequences , 2004, Multimedia Tools and Applications.

[22]  Wolfgang Effelsberg,et al.  VisualGREP: a systematic method to compare and retrieve video sequences , 1997, Electronic Imaging.

[23]  Myron Flickner,et al.  Query by Image and Video Content , 1995 .

[24]  John P. Oakley,et al.  Storage and Retrieval for Image and Video Databases , 1993 .

[25]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Mohamed Abdel-Mottaleb,et al.  Content-based video retrieval by example video clip , 1997, Electronic Imaging.

[27]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[28]  Patrick Pérez,et al.  Nonparametric motion characterization using causal probabilistic models for video indexing and retrieval , 2002, IEEE Trans. Image Process..

[29]  David S. Doermann,et al.  Indexing and retrieval of the MPEG compressed video , 1998, J. Electronic Imaging.

[30]  Forouzan Golshani,et al.  Motion recovery for video content classification , 1995, TOIS.