InfAR dataset: Infrared action recognition at different times

Abstract Action recognition (AR) is one of the most important tasks in video analysis and computer vision. Recently a large number of related methods have been proposed. While most of these methods are investigated on AR datasets collected from the visible spectrum, the AR problem under infrared scenarios still has not attracted much attention. There is even few public infrared datasets available for supporting the fundamental evaluation requirements of this research. To this issue, this work aims to emphasize the importance of the infrared AR problem in applications and arouse researchers' attention on this task. Specifically, we construct a new Inf rared A ction R ecognition ( InfAR ) dataset captured at different times, including in summer and winter, and explore how discriminable actions in our InfAR dataset are with the state-of-the-art pipelines based on low-level features and deep convolutional neural network (CNN), respectively. Our results reveal: (1) In all, dense trajectory feature can achieve the best performance while the appearance features, e.g., HOG, have relatively poorer performance; (2) the encoding method of vector of locally aggregated descriptors is evidently better than that of the widely-used Fisher Vector; (3) the late fusion facilitates a better performance than early fusion; (4) action videos captured in winter is more discriminable than in summer; (5) compared to appearance information, the motion information is more essential for infrared action recognition and utilizing this information through deep CNN can improve greatly the performance. The best performance achieved on our dataset is 76.66% (Average Precision), leaving a reasonable space for further exploring the insights underlying such type of infrared AR problem and accordingly designing proper techniques to further promote the performance on this specifically constructed InfAR dataset.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[5]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[6]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[7]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jiang Liu,et al.  From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition , 2015, World Wide Web.

[10]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[13]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[14]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[15]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[16]  Bir Bhanu,et al.  Human Activity Recognition in Thermal Infrared Imagery , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[17]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Anil K. Jain,et al.  Heterogeneous Face Recognition Using Kernel Prototype Similarities , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Yi Yang,et al.  Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization , 2015, International Journal of Computer Vision.

[22]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[23]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Alexandros Iosifidis,et al.  Regularized extreme learning machine for multi-view semi-supervised action recognition , 2014, Neurocomputing.

[25]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Nicu Sebe,et al.  Harnessing Lab Knowledge for Real-World Action Recognition , 2014, International Journal of Computer Vision.

[28]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[29]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chin-Pan Huang,et al.  Human Action Recognition Using Histogram of Oriented Gradient of Motion History Image , 2011, 2011 First International Conference on Instrumentation, Measurement, Computer, Communication and Control.

[32]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[33]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[35]  Du-Ming Tsai,et al.  Optical flow-motion history image (OF-MHI) for action recognition , 2015, Signal Image Video Process..

[36]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[37]  Haitao Zhao,et al.  Sparse tensor embedding based multispectral face recognition , 2014, Neurocomputing.

[38]  Deyu Meng,et al.  Interactive Surveillance Event Detection through Mid-level Discriminative Representation , 2014, ICMR.

[39]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[40]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[41]  Ming Xin,et al.  Adaptive multi-cue based particle swarm optimization guided particle filter tracking in infrared videos , 2013, Neurocomputing.

[42]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[43]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[44]  Jiang-tao Wang,et al.  On pedestrian detection and tracking in infrared videos , 2012, Pattern Recognit. Lett..

[45]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[46]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[47]  Qiuqi Ruan,et al.  Context and locality constrained linear coding for human action recognition , 2015, Neurocomputing.

[48]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[49]  Deyu Meng,et al.  A New Dataset and Evaluation for Infrared Action Recognition , 2015, CCCV.

[50]  Wei Liu,et al.  Double Fusion for Multimedia Event Detection , 2012, MMM.

[51]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[52]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[53]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Deyu Meng,et al.  A Novel Group-Sparsity-Optimization-Based Feature Selection Model for Complex Interaction Recognition , 2014, ACCV.

[55]  Ling Shao,et al.  Combining appearance and structural features for human action recognition , 2013, Neurocomputing.

[56]  Bir Bhanu,et al.  Fusion of color and infrared video for moving human detection , 2007, Pattern Recognit..