Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking

Action recognition has received increasing attention from the computer vision and machine learning communities in the last decade. To enable the study of this problem, there exist a vast number of action datasets, which are recorded under controlled laboratory settings, real-world surveillance environments, or crawled from the Internet. Apart from the "in-the-wild" datasets, the training and test split of conventional datasets often possess similar environments conditions, which leads to close to perfect performance on constrained datasets. In this paper, we introduce a new dataset, namely Multi-Camera Action Dataset (MCAD), which is designed to evaluate the open view classification problem under the surveillance environment. In total, MCAD contains 14,298 action samples from 18 action categories, which are performed by 20 subjects and independently recorded with 5 cameras. Inspired by the well received evaluation approach on the LFW dataset, we designed a standard evaluation protocol and benchmarked MCAD under several scenarios. The benchmark shows that while an average of 85% accuracy is achieved under the closed-view scenario, the performance suffers from a significant drop under the cross-view scenario. In the worst case scenario, the performance of 10-fold cross validation drops from 87.0% to 47.4%.

[1]  Xuelong Li,et al.  Flowing on Riemannian Manifold: Domain Adaptation by Shifting Covariance , 2014, IEEE Transactions on Cybernetics.

[2]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[3]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[4]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[5]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[7]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[8]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[9]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  Rama Chellappa,et al.  Cross-View Action Recognition via a Transferable Dictionary Pair , 2012, BMVC.

[13]  Liang Wang,et al.  Learning and Matching of Dynamic Shape Manifolds for Human Action Recognition , 2007, IEEE Transactions on Image Processing.

[14]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Weizhi Nie,et al.  Human Action Recognition Bases on Local Action Attributes , 2015 .

[16]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[17]  Philip H. S. Torr,et al.  Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos , 2013, International Journal of Computer Vision.

[18]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[19]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[21]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Terrance E. Boult,et al.  Towards Open World Recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[24]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[25]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[28]  Conrad Sanderson,et al.  Log-Euclidean bag of words for human action recognition , 2014, IET Comput. Vis..

[29]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[30]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[32]  Tal Hassner,et al.  The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Weizhi Nie,et al.  Evaluation of local spatial-temporal features for cross-view action recognition , 2016, Neurocomputing.

[34]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[35]  Pengjiang Qian,et al.  Collaborative Fuzzy Clustering From Multiple Weighted Views , 2015, IEEE Transactions on Cybernetics.

[36]  Hossein Ragheb,et al.  MuHAVi: A Multicamera Human Action Video Dataset for the Evaluation of Action Recognition Methods , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[37]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[38]  Larry S. Davis,et al.  Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[40]  Ali Farhadi,et al.  Learning to Recognize Activities from the Wrong View Point , 2008, ECCV.

[41]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[42]  Wei Liang,et al.  Human action recognition using discriminative models in the learned hierarchical manifold space , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[43]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[45]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Ivor W. Tsang,et al.  Domain Transfer Multiple Kernel Learning , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[49]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[52]  Douglas A. Reynolds Gaussian Mixture Models , 2009, Encyclopedia of Biometrics.

[53]  Gang Wang,et al.  Solution Path for Manifold Regularized Semisupervised Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).