Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach

Extracting discriminative and robust features from video sequences is the first and most critical step in human action recognition. In this paper, instead of using handcrafted features, we automatically learn spatio-temporal motion features for action recognition. This is achieved via an evolutionary method, i.e., genetic programming (GP), which evolves the motion feature descriptor on a population of primitive 3D operators (e.g., 3D-Gabor and wavelet). In this way, the scale and shift invariant features can be effectively extracted from both color and optical flow sequences. We intend to learn data adaptive descriptors for different datasets with multiple layers, which makes fully use of the knowledge to mimic the physical structure of the human visual cortex for action recognition and simultaneously reduce the GP searching space to effectively accelerate the convergence of optimal solutions. In our evolutionary architecture, the average cross-validation classification error, which is calculated by an support-vector-machine classifier on the training set, is adopted as the evaluation criterion for the GP fitness function. After the entire evolution procedure finishes, the best-so-far solution selected by GP is regarded as the (near-)optimal action descriptor obtained. The GP-evolving feature extraction method is evaluated on four popular action datasets, namely KTH, HMDB51, UCF YouTube, and Hollywood2. Experimental results show that our method significantly outperforms other types of features, either hand-designed or machine-learned.

[1]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  E. Meijering,et al.  A chronology of interpolation: from ancient astronomy to modern signal and image processing , 2002, Proc. IEEE.

[3]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[5]  Andries P. Engelbrecht,et al.  Dynamic Clustering using Particle Swarm Optimization with Application in Unsupervised Image Classification , 2007 .

[6]  Laurent Itti,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Rapid Biologically-inspired Scene Classification Using Features Shared with Visual Attention , 2022 .

[7]  Lei Guo,et al.  An Object-Oriented Visual Saliency Detection Framework Based on Sparse Coding Representations , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Edward A. Fox,et al.  A genetic programming framework for content-based image retrieval , 2009, Pattern Recognit..

[10]  Sean Luke,et al.  Lexicographic Parsimony Pressure , 2002, GECCO.

[11]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  I. Daubechies,et al.  Biorthogonal bases of compactly supported wavelets , 1992 .

[14]  Lalit M. Patnaik,et al.  Application of genetic programming for multicategory pattern classification , 2000, IEEE Trans. Evol. Comput..

[15]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Mengjie Zhang,et al.  Multiclass Object Classification Using Genetic Programming , 2004, EvoWorkshops.

[17]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[18]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[19]  R. Poli Genetic programming for image analysis , 1996 .

[20]  Ling Shao,et al.  Genetic Programming-Evolved Spatio-Temporal Descriptor for Human Action Recognition , 2012, BMVC.

[21]  Xuelong Li,et al.  General Tensor Discriminant Analysis and Gabor Features for Gait Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Rahul Sukthankar,et al.  Exploiting multi-level parallelism for low-latency activity recognition in streaming video , 2010, MMSys '10.

[24]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[25]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[28]  E. Meijering A chronology of interpolation: from ancient astronomy to modern signal and image processing , 2002, Proc. IEEE.

[29]  Andreas Zell,et al.  Evolving Task Specific Image Operator , 1999, EvoWorkshops.

[30]  Bir Bhanu,et al.  Adaptive image segmentation using a genetic algorithm , 1989, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[33]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[34]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[35]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[36]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[37]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[38]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[39]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[40]  Vito Di Gesù,et al.  A Memetic Algorithm for Binary Image Reconstruction , 2008, IWCIA.

[41]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  R. Keys Cubic convolution interpolation for digital image processing , 1981 .

[43]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Banani Saha,et al.  Optimal face recognition method using ant colony based Back Propagation network , 2009, 2009 4th International Conference on Computers and Devices for Communication (CODEC).

[45]  Ling Shao,et al.  Realistic action recognition via sparsely-constructed Gaussian processes , 2014, Pattern Recognit..

[46]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Feng Wu,et al.  Background Prior-Based Salient Object Detection via Deep Reconstruction Residual , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[52]  Daniel Howard,et al.  Target detection in SAR imagery by genetic programming , 1999 .

[53]  Leonardo Trujillo,et al.  Synthesis of interest point detectors through genetic programming , 2006, GECCO.

[54]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[55]  Mengjie Zhang,et al.  Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach , 2013, IEEE Transactions on Cybernetics.

[56]  James W. Davis Hierarchical motion history images for recognizing human motion , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[57]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[58]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[59]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[60]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[62]  Xiang Ji,et al.  Representing and Retrieving Video Shots in Human-Centric Brain Imaging Space , 2013, IEEE Transactions on Image Processing.

[63]  Mark Johnston,et al.  Particle Swarm Optimization based Adaboost for face detection , 2009, 2009 IEEE Congress on Evolutionary Computation.

[64]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[65]  Asoke K. Nandi,et al.  Feature generation using genetic programming with application to fault classification , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).