Actor-independent action search using spatiotemporal vocabulary with appearance hashing

Human actions in movies and sitcoms usually capture semantic cues for story understanding, which offer a novel search pattern beyond the traditional video search scenario. However, there are great challenges to achieve action-level video search, such as global motions, concurrent actions, and actor appearance variances. In this paper, we introduce a generalized action retrieval framework, which achieves fully unsupervised, robust, and actor-independent action search in large-scale database. First, an Attention Shift model is presented to extract human-focused foreground actions from videos containing global motions or concurrent actions. Subsequently, a spatiotemporal vocabulary is built based on 3D-SIFT features extracted from these human-focused action regions. These 3D-SIFT features offer robustness against rotations and viewpoints. And the spatiotemporal vocabulary guarantees our search efficiency, which is achieved by inverted indexing structure with approximate nearest-neighbor search. In the online ranking, we employ dynamic time warping distance to handle the action duration variances, as well as partial action matching. Finally, an appearance hashing strategy is presented to address the performance degeneration caused by divergent actor appearances. For experimental validation, we have deployed actor-independent action retrieval framework in 3-season ''Friends'' sitcoms (over 30h). In this database, we have reported the best performance (MAP@1>0.53) with comparisons to alternative and state-of-the-art approaches.

[1]  Rongrong Ji,et al.  Attention-driven action retrieval with DTW-based 3d descriptor matching , 2008, ACM Multimedia.

[2]  Olivier Buisson,et al.  Robust Content-Based Video Copy Identification in a Large Reference Database , 2003, CIVR.

[3]  Shan Li,et al.  Efficient spatiotemporal-attention-driven shot matching , 2007, ACM Multimedia.

[4]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[8]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[9]  Rongrong Ji,et al.  Place retrieval with graph-based place-view model , 2008, MIR '08.

[10]  Olivier Buisson,et al.  Content-Based Copy Retrieval Using Distortion-Based Probabilistic Similarity Search , 2007, IEEE Transactions on Multimedia.

[11]  Avideh Zakhor,et al.  Fast similarity search and clustering of video sequences on the world-wide-web , 2005, IEEE Transactions on Multimedia.

[12]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[14]  Osama Masoud,et al.  A method for human action recognition , 2003, Image Vis. Comput..

[15]  Ruud M. Bolle,et al.  Comparison of sequence matching techniques for video copy detection , 2001, IS&T/SPIE Electronic Imaging.

[16]  Wen-Huang Cheng,et al.  A Visual Attention Based Region-of-Interest Determination Framework for Video Sequences , 2005, IEICE Trans. Inf. Syst..

[17]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, CVPR 2009.

[18]  Timothy K. Shih,et al.  On automatic actions retrieval of martial arts , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[19]  Tao Wang,et al.  Cast indexing for videos by NCuts and page ranking , 2007, CIVR '07.

[20]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Bo Zhang,et al.  A unified shot boundary detection framework based on graph partition model , 2005, MULTIMEDIA '05.

[22]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Zi Huang,et al.  UQLIPS: A Real-time Near-duplicate Video Clip Detection System , 2007, VLDB.

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[26]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[27]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[28]  H. Pashler,et al.  The Psychology of Attention , 2000 .

[29]  Ming Liu,et al.  Hierarchical Space-Time Model Enabling Efficient Search for Human Actions , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[31]  Ilaria Bartolini,et al.  WARP: accurate retrieval of shapes using phase of Fourier descriptors and time warping distance , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[33]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[36]  David J. Fleet,et al.  Optical Flow Estimation , 2006, Handbook of Mathematical Models in Computer Vision.

[37]  Mubarak Shah,et al.  Visual attention detection in video sequences using spatiotemporal cues , 2006, MM '06.

[38]  Nikos Paragios,et al.  Handbook of Mathematical Models in Computer Vision , 2005 .

[39]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[40]  Chin-Seng Chua,et al.  Face recognition from 2D and 3D images using 3D Gabor filters , 2005, Image Vis. Comput..

[41]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[42]  C. L. M. The Psychology of Attention , 1890, Nature.

[43]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[44]  I. Patras,et al.  Spatiotemporal salient points for visual recognition of human actions , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[45]  David C. Hogg,et al.  Learning Variable-Length Markov Models of Behavior , 2001, Comput. Vis. Image Underst..

[46]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[47]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[48]  Rongrong Ji,et al.  Vocabulary hierarchy optimization for effective and transferable retrieval , 2009, CVPR.