Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system’s feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.

[1]  Ahmed M. Elgammal,et al.  Information Theoretic Key Frame Selection for Action Recognition , 2008, BMVC.

[2]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Hassan Foroosh,et al.  View-Invariant Action Recognition from Point Triplets , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[6]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Christian Bauckhage,et al.  Action recognition by learning discriminative key poses , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[8]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Michael J. Black,et al.  Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  J. Sullivan,et al.  Action Recognition by Shape Matching to Key Frames , 2002 .

[12]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Rama Chellappa,et al.  Key Frame-Based Activity Representation Using Antieigenvalues , 2006, ACCV.

[14]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[15]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[16]  Joseph J. LaViola,et al.  Exploring strategies and guidelines for developing full body video game interfaces , 2010, FDG.

[17]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[18]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Darko Kirovski,et al.  Real-time classification of dance gestures from skeleton animation , 2011, SCA '11.

[20]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[22]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[23]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[25]  Paul A. Viola,et al.  Online decoding of Markov models under latency constraints , 2006, ICML.

[26]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[27]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[28]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Johanna D. Moore,et al.  Proceedings of the Conference on Human Factors in Computing Systems , 1989 .

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  James W. Davis,et al.  Minimal-latency human action recognition using reliable-inference , 2006, Image Vis. Comput..

[33]  Ling Shao,et al.  Motion Histogram Analysis Based Key Frame Extraction for Human Action/Activity Representation , 2009, 2009 Canadian Conference on Computer and Robot Vision.

[34]  Geoffrey E. Hinton,et al.  Learning Generative Texture Models with extended Fields-of-Experts , 2009, BMVC.

[35]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[36]  Meinard Müller,et al.  Motion templates for automatic classification and retrieval of motion capture data , 2006, SCA '06.

[37]  Magy Seif El-Nasr,et al.  International Conference on the Foundations of Digital Games , 2012, FDG 2012.

[38]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[39]  Nazar Khan,et al.  Correcting cuboid corruption for action recognition in complex environment , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[40]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.