Learning 3D action models from a few 2D videos for view invariant action recognition

Most existing approaches for learning action models work by extracting suitable low-level features and then training appropriate classifiers. Such approaches require large amounts of training data and do not generalize well to variations in viewpoint, scale and across datasets. Some work has been done recently to learn multi-view action models from Mocap data, but obtaining such data is time consuming and requires costly infrastructure. We present a method that addresses both these issues by learning action models from just a few video training samples. We model each action as a sequence of primitive actions, represented as functions which transform the actor's state. We formulate model learning as a curve-fitting problem, and present a novel algorithm for learning human actions by lifting 2D annotations of a few keyposes to 3D and interpolating between them. Actions are inferred by sampling the models and accumulating the feature weights learned discriminatively using a latent state Perceptron algorithm. We show results comparable to state-of-art on the standard Weizmann dataset, with a much smaller train:test ratio, and also in datasets for visual gesture recognition and cluttered grocery store environments.

[1]  Leslie Lamport,et al.  The temporal logic of actions , 1994, TOPL.

[2]  Cristian Sminchisescu,et al.  Conditional Random Fields for Contextual Human Motion Recognition , 2005, ICCV.

[3]  Svetha Venkatesh,et al.  Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Dieter Fox,et al.  CRF-Filters: Discriminative Particle Filters for Sequential State Estimation , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[5]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[6]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[7]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Larry S. Davis,et al.  Multi-Cue Exemplar-Based Nonparametric Model for Gesture Recognition , 2004, ICVGIP.

[9]  Ramakant Nevatia,et al.  View and scale invariant action recognition using multiview shape-flow models , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  James M. Rehg,et al.  Reconstruction of 3D figure motion from 2D correspondences , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  David A. Forsyth,et al.  Searching Video for Complex Activities with Finite State Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Xinghua Sun,et al.  Action recognition via local descriptors and holistic features , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[17]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Rémi Ronfard,et al.  Automatic Discovery of Action Taxonomies from Multiple Views , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Ann Hutchinson Guest Labanotation: The System of Analyzing and Recording Movement , 1987 .

[20]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[21]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[23]  Béla Ágai,et al.  CONDENSED 1,3,5-TRIAZEPINES - V THE SYNTHESIS OF PYRAZOLO [1,5-a] [1,3,5]-BENZOTRIAZEPINES , 1983 .