Efficient Search and Localization of Human Actions in Video Databases

As digital video databases grow, so grows the problem of effectively navigating through them. In this paper we propose a novel content-based video retrieval approach to searching such video databases, specifically those involving human actions, incorporating spatio-temporal localization. We outline a novel, highly efficient localization model that first performs temporal localization based on histograms of evenly spaced time-slices, then spatial localization based on histograms of a 2-D spatial grid. We further argue that our retrieval model, based on the aforementioned localization, followed by relevance ranking, results in a highly discriminative system, while remaining an order of magnitude faster than the current state-of-the-art method. We also show how relevance feedback can be applied to our localization and ranking algorithms. As a result, the presented system is more directly applicable to real-world problems than any prior content-based video retrieval system.

[1]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[2]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[3]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[4]  Dacheng Tao,et al.  Biased Discriminant Euclidean Embedding for Content-Based Image Retrieval , 2010, IEEE Transactions on Image Processing.

[5]  James W. Davis,et al.  The representation and recognition of human movement using temporal templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[7]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[8]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[11]  Remi Depommier,et al.  Content-based browsing of video sequences , 1994, MULTIMEDIA '94.

[12]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[14]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Maja Pantic,et al.  Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences , 2011, IEEE Transactions on Image Processing.

[16]  Ling Shao,et al.  Rapid Localisation and Retrieval of Human Actions with Relevance Feedback , 2013, CAIP.

[17]  Meng Wang,et al.  Real-Time Video Copy-Location Detection in Large-Scale Repositories , 2011, IEEE MultiMedia.

[18]  Ming Liu,et al.  Hierarchical Space-Time Model Enabling Efficient Search for Human Actions , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[20]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Ling Shao,et al.  Relevance feedback for real-world human action retrieval , 2012, Pattern Recognit. Lett..

[23]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Xian-Sheng Hua,et al.  Active Reranking for Web Image Search , 2010, IEEE Transactions on Image Processing.

[25]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[26]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Rong Yan,et al.  Negative pseudo-relevance feedback in content-based video retrieval , 2003, MULTIMEDIA '03.

[28]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[29]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[32]  Ling Shao,et al.  Retrieving Human Actions Using Spatio-Temporal Features and Relevance Feedback , 2010 .

[33]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Rongrong Ji,et al.  Actor-independent action search using spatiotemporal vocabulary with appearance hashing , 2011, Pattern Recognit..

[35]  Mubarak Shah,et al.  Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[36]  Li Wang,et al.  Human Action Recognition and Localization in Video Using Structured Learning of Local Space-Time Features , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[37]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[38]  Qi Tian,et al.  Incorporate support vector machines to content-based image retrieval with relevance feedback , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[39]  Xuelong Li,et al.  Negative Samples Analysis in Relevance Feedback , 2007, IEEE Transactions on Knowledge and Data Engineering.

[40]  Ling Shao,et al.  Content-based retrieval of human actions from realistic video databases , 2013, Inf. Sci..

[41]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[42]  Christoph H. Lampert,et al.  Efficient Subwindow Search: A Branch and Bound Framework for Object Localization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .