Using a Product Manifold distance for unsupervised action recognition

This paper presents a method for unsupervised learning and recognition of human actions in video. Lacking any supervision, there is nothing except the inherent biases of a given representation to guide grouping of video clips along semantically meaningful partitions. Thus, in the first part of this paper, we compare two contemporary methods, Bag of Features (BOF) and Product Manifolds (PM), for clustering video clips of human facial expressions, hand gestures, and full-body actions, with the goal of better understanding how well these very different approaches to behavior recognition produce semantically relevant clustering of data. We show that PM yields superior results when measuring the alignment between the generated clusters and the nominal class labeling of the data set. We found that while gross motions were easily clustered by both methods, the lack of preservation of structural information inherent to the BOF representation leads to limitations that are not easily overcome without supervised training. This was evidenced by the poor separation of shape labels in the hand gestures data by BOF, and the overall poor performance on full-body actions. In the second part of this paper, we present an unsupervised mechanism for learning micro-actions in continuous video streams using the PM representation. Unlike other works, our method requires no prior knowledge of an expected number of labels/classes, requires no silhouette extraction, is tolerant to minor tracking errors and jitter, and can operate at near real-time speed. We show how to construct a set of training ''tracklets,'' how to cluster them using the Product Manifold distance measure, and how to perform detection using exemplars learned from the clusters. Further, we show that the system is amenable to incremental learning as anomalous activities are detected in the video stream. We demonstrate performance using the publicly-available ETHZ Livingroom data set.

[1]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  S. Kollias,et al.  Dense saliency-based spatiotemporal feature points for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Luc Van Gool,et al.  Exploiting simple hierarchies for unsupervised human behavior analysis , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5]  Yui Man Lui,et al.  Advances in matrix manifolds for computer vision , 2012, Image Vis. Comput..

[6]  D. Wolpert The Supervised Learning No-Free-Lunch Theorems , 2002 .

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[8]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[10]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[12]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[13]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[14]  J. Ross Beveridge,et al.  Action classification on product manifolds , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Bruce A. Draper,et al.  Unsupervised learning of micro-action exemplars using a Product Manifold , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[16]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[17]  Janusz Konrad,et al.  Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels , 2010, ICPR Contests.

[18]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[19]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[21]  Bruce A. Draper,et al.  Unsupervised learning of human expressions, gestures, and actions , 2011, Face and Gesture 2011.

[22]  Ahmed M. Elgammal,et al.  Tracking People on a Torus , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Andrew Gilbert,et al.  Push and Pull: Iterative grouping of media , 2011, BMVC.