3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding

We describe a new approach to transfer knowledge across views for action recognition by using examples from a large collection of unlabelled mocap data. We achieve this by directly matching purely motion based features from videos to mocap. Our approach recovers 3D pose sequences without performing any body part tracking. We use these matches to generate multiple motion projections and thus add view invariance to our action recognition model. We also introduce a closed form solution for approximate non-linear Circulant Temporal Encoding (nCTE), which allows us to efficiently perform the matches in the frequency domain. We test our approach on the challenging unsupervised modality of the IXMAS dataset, and use publicly available motion capture data for matching. Without any additional annotation effort, we are able to significantly outperform the current state of the art.

[1]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[2]  Patrick Pérez,et al.  View-Independent Action Recognition from Temporal Self-Similarities , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Svetha Venkatesh,et al.  Tracking-as-Recognition for Articulated Full-Body Human Motion Analysis , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Kristen Grauman,et al.  Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yunde Jia,et al.  View-Invariant Action Recognition Using Latent Kernelized Structural SVM , 2012, ECCV.

[6]  Rama Chellappa,et al.  View Invariance for Human Action Recognition , 2005, International Journal of Computer Vision.

[7]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[8]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ronen Basri,et al.  Direct visibility of point sets , 2007, ACM Trans. Graph..

[10]  Daniel Weinland,et al.  Action Representation and Recognition , 2008 .

[11]  Chunheng Wang,et al.  Cross-View Action Recognition via a Continuous Virtual Path , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[13]  Jean-Christophe Nebel,et al.  View and Style-Independent Action Manifolds for Human Activity Recognition , 2010, ECCV.

[14]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[18]  Florent Perronnin,et al.  Large-scale image categorization with explicit data embedding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[21]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  David J. Fleet,et al.  The Kneed Walker for human pose tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[24]  Ali Farhadi,et al.  Learning to Recognize Activities from the Wrong View Point , 2008, ECCV.

[25]  Rui Caseiro,et al.  Exploiting the Circulant Structure of Tracking-by-Detection with Kernels , 2012, ECCV.

[26]  Ivan Laptev,et al.  Actlets: A novel local representation for human action recognition in video , 2012, 2012 19th IEEE International Conference on Image Processing.

[27]  Jessica K. Hodgins,et al.  Video-based 3D motion capture through biped control , 2012, ACM Trans. Graph..

[28]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Ruonan Li,et al.  Discriminative virtual views for cross-view action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[33]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[34]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[35]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.