Transform based spatio-temporal descriptors for human action recognition

Classic transformation methods have been widely and efficiently used in image processing areas, such as image de-noising, image segmentation, feature detection, and compression. Based on their compact signal and image representation ability, we apply the transform based techniques on the video recognition area to extract discriminative information from each given video sequence, and use the transformed coefficients as descriptors for representing and recognizing human actions in video sequences. We validate our proposed methods on the KTH and the Hollywood datasets, which have been extensively studied by a lot of researchers. The proposed descriptors, especially the wavelet transform based descriptor, yield promising results on action recognition.

[1]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[3]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5]  F. Campbell,et al.  Orientational selectivity of the human visual system , 1966, The Journal of physiology.

[6]  Mac A. Cody The fast wavelet transform , 1992 .

[7]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Michel Barlaud,et al.  Image coding using wavelet transform , 1992, IEEE Trans. Image Process..

[9]  Shih-Fu Chang,et al.  Transform features for texture classification and discrimination in large image databases , 1994, Proceedings of 1st International Conference on Image Processing.

[10]  Rafael García,et al.  Fusion of multispectral and panchromatic images using improved IHS and PCA mergers based on wavelet decomposition , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[11]  C. K. Yuen,et al.  Review of "The Fast Fourier Transform" by E. O. Brigham , 1978, IEEE Transactions on Systems, Man and Cybernetics.

[12]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[13]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[14]  Ling Shao,et al.  Spatio-temporal shape contexts for human action retrieval , 2009, IMCE '09.

[15]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Alain Rakotomamonjy,et al.  Object Categorization Using Kernels Combining Graphs and Histograms of Gradients , 2006, ICIAR.

[18]  Sharlee Climer,et al.  Image database indexing using JPEG coefficients , 2002, Pattern Recognit..

[19]  Yang Song,et al.  Unsupervised Learning of Human Motion , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21]  T. Koornwinder Wavelets : an elementary treatment of theory and applications , 1993 .

[22]  Andrew B. Watson,et al.  Image Compression Using the Discrete Cosine Transform , 1994 .

[23]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[24]  Alan R. Jones,et al.  Fast Fourier Transform , 1970, SIGP.

[25]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Joan L. Mitchell,et al.  JPEG: Still Image Data Compression Standard , 1992 .

[28]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[29]  Ling Shao,et al.  Feature detector and descriptor evaluation in human action recognition , 2010, CIVR '10.

[30]  Silvio Savarese,et al.  Discriminative Object Class Models of Appearance and Shape by Correlatons , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[33]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[34]  Ling Shao,et al.  A Wavelet Based Local Descriptor for Human Action Recognition , 2010, BMVC.

[35]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[36]  Nicu Sebe,et al.  Comparing salient point detectors , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[37]  Ryutarou Ohbuchi,et al.  Shape similarity comparison of 3D models using alpha shapes , 2003, 11th Pacific Conference onComputer Graphics and Applications, 2003. Proceedings..