Volumetric Features for Video Event Detection

Real-world actions occur often in crowded, dynamic environments. This poses a difficult challenge for current approaches to video event detection because it is difficult to segment the actor from the background due to distracting motion from other objects in the scene. We propose a technique for event recognition in crowded videos that reliably identifies actions in the presence of partial occlusion and background clutter. Our approach is based on three key ideas: (1) we efficiently match the volumetric representation of an event against oversegmented spatio-temporal video volumes; (2) we augment our shape-based features using flow; (3) rather than treating an event template as an atomic entity, we separately match by parts (both in space and time), enabling robustness against occlusions and actor variability. Our experiments on human actions, such as picking up a dropped object or waving in a crowd show reliable detection with few false positives.

[1]  Timothy F. Cootes,et al.  Active Shape Models - 'smart snakes' , 1992, BMVC.

[2]  Martial Hebert,et al.  Spatio-temporal Shape and Flow Correlation for Action Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Edouard Thiel,et al.  Optimizing 3D chamfer masks with norm constraints , 2000 .

[4]  Jitendra Malik,et al.  Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Paul A. Viola,et al.  Face Recognition Using Boosted Local Features , 2003 .

[8]  B. Kimia,et al.  3D object recognition using shape similiarity-based aspect graph , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  Martial Hebert,et al.  Semi-supervised training of models for appearance-based statistical object detection methods , 2004 .

[10]  Ilan Shimshoni,et al.  Mean shift based clustering in high dimensions: a texture classification example , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Maja J. Mataric,et al.  Deriving action and behavior primitives from human motion data , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Jianbo Shi,et al.  Bottom-up Recognition and Parsing of the Human Body , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  R. Basri,et al.  Shape representation and classification using the Poisson equation , 2004, CVPR 2004.

[15]  Takeo Kanade,et al.  Mode-seeking by Medoidshifts , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Marcus Jerome Pickering,et al.  Video Retrieval by Feature Learning in Key Frames , 2002, CIVR.

[17]  Peter Meer,et al.  Synergism in low level vision , 2002, Object recognition supported by user interaction for service robots.

[18]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Edouard Thiel,et al.  Computing 3D Medial Axis for Chamfer Distances , 2000, DGCI.

[20]  MedioniGérard,et al.  Event Detection and Analysis from Video Streams , 2001 .

[21]  Amit Bandapadhay,et al.  Searching parameter spaces with noisy linear constraints , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Rama Chellappa,et al.  Activity recognition using the dynamics of the configuration of interacting objects , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  Jan P. Allebach,et al.  Multiscale branch-and-bound image database search , 1997, Electronic Imaging.

[25]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[26]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[28]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[29]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[30]  Eli Shechtman,et al.  Space-Time Behavior-Based Correlation-OR-How to Tell If Two Underlying Motion Fields Are Similar Without Computing Them? , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Osama Masoud,et al.  A method for human action recognition , 2003, Image Vis. Comput..

[33]  Serge Beucher,et al.  THE WATERSHED TRANSFORMATION APPLIED TO IMAGE SEGMENTATION , 2009 .

[34]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[35]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[36]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[37]  Thomas Serre,et al.  Component-based face detection , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[38]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[39]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[40]  Daniel Cremers,et al.  Variational space-time motion segmentation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41]  John Hart,et al.  ACM Transactions on Graphics: Editorial , 2003, SIGGRAPH 2003.

[42]  Sang Uk Lee,et al.  Efficient video indexing scheme for content-based retrieval , 1999, IEEE Trans. Circuits Syst. Video Technol..

[43]  M. Shah,et al.  Actions As Objects : A Novel Action Representation , 2005 .

[44]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[45]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[46]  Takeo Kanade,et al.  Learning GMRF Structures for Spatial Priors , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Haibin Ling,et al.  Shape Classification Using the Inner-Distance , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Dorin Comaniciu,et al.  An Algorithm for Data-Driven Bandwidth Selection , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Robert C. Bolles,et al.  Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching , 1977, IJCAI.

[50]  David S. Doermann,et al.  Video retrieval of near-duplicates using κ-nearest neighbor retrieval of spatio-temporal descriptors , 2006, Multimedia Tools and Applications.

[51]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[52]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[53]  Mubarak Shah,et al.  Motion layer extraction in the presence of occlusion using graph cuts , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Ze-Nian Li,et al.  Successive Convex Matching for Action Detection , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[55]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[56]  Ashok Veeraraghavan,et al.  The Function Space of an Activity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[57]  Randal C. Nelson,et al.  Detection and Recognition of Periodic, Nonrigid Motion , 1997, International Journal of Computer Vision.

[58]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[59]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[60]  Peter Meer,et al.  Edge Detection with Embedded Confidence , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  Ruggero Frezza,et al.  Nonlinear control by a finite set of motion primitives , 2004 .

[62]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[63]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[64]  Robert T. Collins,et al.  Mean-shift blob tracking through scale space , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[65]  Tsuhan Chen,et al.  Hierarchical matching for retrieval of hand-drawn sketches , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[66]  Wen Gao,et al.  Action Recognition in Broadcast Tennis Video , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[67]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[68]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  John R. Smith,et al.  Learning and Classification of Semantic Concepts in Broadcast Video , .

[70]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[71]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[72]  Jitendra Malik,et al.  Shape Guided Object Segmentation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[73]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[74]  Dariu Gavrila,et al.  Multi-feature hierarchical template matching using distance transforms , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[75]  Greg Mori,et al.  Guiding model search using segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[76]  James M. Rehg,et al.  Learning a Rare Event Detection Cascade by Direct Feature Selection , 2003, NIPS.

[77]  Guillermo Sapiro,et al.  A Geodesic Framework for Fast Interactive Image and Video Segmentation and Matting , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[78]  Jianbo Shi,et al.  Recognizing objects by piecing together the Segmentation Puzzle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Pietro Perona,et al.  Decomposition of human motion into dynamics-based primitives with application to drawing tasks , 2003, Autom..

[80]  Dariu Gavrila,et al.  Pedestrian Detection from a Moving Vehicle , 2000, ECCV.

[81]  Miguel Á. Carreira-Perpiñán Acceleration Strategies for Gaussian Mean-Shift Image Segmentation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[82]  Jean-Marc Odobez,et al.  Robust Multiresolution Estimation of Parametric Motion Models , 1995, J. Vis. Commun. Image Represent..

[83]  Irfan A. Essa,et al.  Unsupervised Activity Discovery and Characterization From Event-Streams , 2005, UAI.

[84]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[85]  David A. Forsyth,et al.  Building models of animals from video , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Michal Irani,et al.  Similarity by Composition , 2006, NIPS.

[87]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[88]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[89]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Hans-Peter Kriegel,et al.  3D Shape Histograms for Similarity Search and Classification in Spatial Databases , 1999, SSD.

[92]  Mubarak Shah,et al.  Object based segmentation of video using color, motion and spatial information , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[93]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[95]  Bo Thiesson,et al.  Image and Video Segmentation by Anisotropic Kernel Mean Shift , 2004, ECCV.

[96]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[97]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[98]  Tieniu Tan,et al.  Recent developments in human motion analysis , 2003, Pattern Recognit..

[99]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[100]  Szymon Rusinkiewicz,et al.  Rotation Invariant Spherical Harmonic Representation of 3D Shape Descriptors , 2003, Symposium on Geometry Processing.

[101]  Anil C. Kokaram,et al.  Semantic Event Detection in Sports Through Motion Understanding , 2004, CIVR.

[102]  Jitendra Malik,et al.  Learning a classification model for segmentation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[103]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[104]  Yaser Sheikh,et al.  Exploring the space of a human action , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[105]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[107]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[109]  Wen Gao,et al.  Action Recognition in Broadcast Tennis Video Using Optical Flow and Support Vector Machine , 2006, ECCV Workshop on HCI.

[110]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[111]  Gunilla Borgefors,et al.  Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[112]  Patrick Bouthemy,et al.  Probabilistic parameter-free motion detection , 2004, CVPR 2004.

[113]  Martial Hebert,et al.  Using Spatio-Temporal Patches for Simultaneous Estimation of Edge Strength, Orientation, and Motion , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[114]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[115]  Zhuowen Tu,et al.  Supervised Learning of Edges and Object Boundaries , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[116]  Jean-Marc Odobez,et al.  Video text recognition using sequential Monte Carlo and error voting methods , 2005, Pattern Recognit. Lett..

[117]  Pablo O. Arambel,et al.  Multiple-hypothesis tracking of multiple ground targets from aerial video with dynamic sensor control , 2004, SPIE Defense + Commercial Sensing.

[118]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[119]  A. Fathi,et al.  Human Pose Estimation using Motion Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[120]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[121]  Clark F. Olson,et al.  Automatic target recognition by matching oriented edge pixels , 1997, IEEE Trans. Image Process..

[122]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[123]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[124]  Marc M. Van Hulle,et al.  A phase-based approach to the estimation of the optical flow field using spatial filtering , 2002, IEEE Trans. Neural Networks.

[125]  Bernt Schiele,et al.  Pedestrian detection in crowded scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[126]  Maneesh Agrawala,et al.  Interactive video cutout , 2005, SIGGRAPH 2005.

[127]  Bernt Schiele,et al.  An Evaluation of Local Shape-Based Features for Pedestrian Detection , 2005, BMVC.

[128]  Rémi Ronfard,et al.  Automatic Discovery of Action Taxonomies from Multiple Views , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[129]  Larry S. Davis,et al.  Mean-shift analysis using quasiNewton methods , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[130]  Luc Van Gool,et al.  Coupled Detection and Trajectory Estimation for Multi-Object Tracking , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[131]  Shimon Ullman,et al.  Semantic Hierarchies for Recognizing Objects and Parts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[132]  Thomas M. Breuel A Comparison of Search Strategies for Geometric Branch and Bound Algorithms , 2002, ECCV.

[133]  Joshua D. Schwartz,et al.  Hierarchical Matching of Deformable Shapes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[134]  Björn Stenger,et al.  Shape context and chamfer matching in cluttered scenes , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[135]  David P. Dobkin,et al.  A search engine for 3D models , 2003, TOGS.

[136]  Emine Yilmaz,et al.  A geometric interpretation of r-precision and its correlation with average precision , 2005, SIGIR '05.

[137]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[138]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[139]  Chitra Dorai,et al.  Automatic text extraction from video for content-based annotation and retrieval , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[140]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[141]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[142]  Joonki Paik,et al.  Color active shape models for tracking non-rigid objects , 2003, Pattern Recognit. Lett..

[143]  David A. Forsyth,et al.  Tracking People by Learning Their Appearance , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[144]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[145]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[146]  William Rucklidge,et al.  Efficient Visual Recognition Using the Hausdorff Distance , 1996, Lecture Notes in Computer Science.

[147]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[148]  Daniel DeMenthon,et al.  SPATIO-TEMPORAL SEGMENTATION OF VIDEO BY HIERARCHICAL MEAN SHIFT ANALYSIS , 2002 .

[149]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[150]  Shimon Ullman,et al.  Combining Class-Specific Fragments for Object Classification , 1999, BMVC.