uulmMAD - A Human Action Recognition Dataset for Ground-Truth Evaluation and Investigation of View Invariances

In recent time, human action recognition has gained increasing attention in pattern recognition. However, many datasets in the literature focus on a limited number of target-oriented properties. Within this work, we present a novel dataset, named uulmMAD, which has been created to benchmark state-of-the-art action recognition architectures addressing multiple properties, e.g. high-resolutions cameras, perspective changes, realistic cluttered background and noise, overlap of action classes, different execution speeds, variability in subjects and their clothing, and the availability of a pose ground-truth. The uulmMAD was recorded using three synchronized high-resolution cameras and an inertial motion capturing system. Each subject performed fourteen actions at least three times in front of a green screen. Selected actions in four variants were recorded, i.e. normal, pausing, fast and deceleration. The data has been post-processed in order to separate the subject from the background. Furthermore, the camera and the motion capturing data have been mapped onto each other and 3D-avatars have been generated to further extend the dataset. The avatars have also been used to emulate the self-occlusion in pose recognition when using a time-of-flight camera. In this work, we analyze the uulmMAD using a state-of-the-art action recognition architecture to provide first baseline results. The results emphasize the unique characteristics of the dataset. The dataset will be made publicity available upon publication of the paper.

[1]  Günther Palm,et al.  Hidden Markov models with graph densities for action recognition , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[2]  Rafael C. González,et al.  Local Determination of a Moving Contrast Edge , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Pierre Kornprobst,et al.  Action Recognition Using a Bio-Inspired Feedforward Spiking Network , 2009, International Journal of Computer Vision.

[4]  Tal Hassner,et al.  A Critical Review of Action Recognition Benchmarks , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[5]  Günther Palm,et al.  Spotting laughter in natural multiparty conversations: A comparison of automatic online and offline approaches using audiovisual data , 2012, TIIS.

[6]  James F. Blinn,et al.  Blue screen matting , 1996, SIGGRAPH.

[7]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[8]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[11]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[12]  Günther Palm,et al.  Combination of sequential class distributions from multiple channels using Markov fusion networks , 2014, Journal on Multimodal User Interfaces.

[13]  D. Roetenberg,et al.  Xsens MVN: Full 6DOF Human Motion Tracking Using Miniature Inertial Sensors , 2009 .

[14]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[15]  Markus Kächele,et al.  Cascaded Fusion of Dynamic, Spatial, and Textural Feature Sets for Person-Independent Facial Emotion Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[16]  Michael Glodek,et al.  A layered architecture for probabilistic complex pattern recognition to detect user preferences , 2014, BICA 2014.

[17]  David Elliott,et al.  In the Wild , 2010 .

[18]  Martin A. Giese,et al.  Learning Representations of Animated Motion Sequences - A Neural Model , 2014, Top. Cogn. Sci..

[19]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Friedhelm Schwenker,et al.  Kalman Filter Based Classifier Fusion for Affective State Recognition , 2013, MCS.

[22]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[23]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[24]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[25]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[26]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[27]  Friedhelm Schwenker,et al.  Recognizing User Preferences Based on Layered Activity Recognition and First-Order Logic , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.