Keep it simple and sparse: real-time action recognition

Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective real-time system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, "All Gestures You Can", to be played against a humanoid robot.

[1]  H. Hirschfeld A Connection between Correlation and Contingency , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[4]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[5]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[6]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  P. Rochat Early Social Cognition : Understanding Others in the First Months of Life , 1999 .

[9]  Cesare Comoldi,et al.  Strategic memory deficits in attention deficit disorder with hyperactivity participants: The role of executive processes , 1999 .

[10]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[11]  Jake K. Aggarwal,et al.  Segmentation and recognition of continuous human activity , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[12]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[14]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  T. Poggio,et al.  Cognitive neuroscience: Neural mechanisms for the recognition of biological movements , 2003, Nature Reviews Neuroscience.

[17]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[19]  Ling Guan,et al.  Continuous human activity recognition , 2004, ICARCV 2004 8th Control, Automation, Robotics and Vision Conference, 2004..

[20]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21]  Thomas Brox,et al.  Universität Des Saarlandes Fachrichtung 6.1 – Mathematik Highly Accurate Optic Flow Computation with Theoretically Justified Warping Highly Accurate Optic Flow Computation with Theoretically Justified Warping , 2022 .

[22]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  M. J. Burden,et al.  Implicit Memory Development in School-Aged Children With Attention Deficit Hyperactivity Disorder (ADHD): Conceptual Priming Deficit? , 2005, Developmental neuropsychology.

[24]  Sheng-Wen Shih,et al.  Continuous Human Action Segmentation and Recognition Using a Spatio-Temporal Probabilistic Framework , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[25]  Paul Lukowicz,et al.  Performance Metrics and Evaluation Issues for Continuous Activity Recognition , 2006 .

[26]  Giorgio Metta,et al.  YARP: Yet Another Robot Platform , 2006 .

[27]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[28]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[29]  Pia Borlund,et al.  Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007, J. Assoc. Inf. Sci. Technol..

[30]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ramakant Nevatia,et al.  Coupled Hidden Semi Markov Models for Activity Recognition , 2007, 2007 IEEE Workshop on Motion and Video Computing (WMVC'07).

[32]  Jesper W. Schneider,et al.  Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007 .

[33]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Frederic Devernay,et al.  A Variational Method for Scene Flow Estimation from Stereo Sequences , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[35]  H. Hirschmüller Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Guillermo Sapiro,et al.  Discriminative learned dictionaries for local image analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[39]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[40]  Giulio Sandini,et al.  The iCub humanoid robot: an open platform for research in embodied cognition , 2008, PerMIS.

[41]  Michael Elad,et al.  Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.

[42]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Francesca Odone,et al.  A Sparsity-Enforcing Method for Learning Face Features , 2009, IEEE Transactions on Image Processing.

[46]  Daniel Cremers,et al.  Stereoscopic Scene Flow Computation for 3D Motion Understanding , 2011, International Journal of Computer Vision.

[47]  L. Fadiga,et al.  Automatic versus Voluntary Motor Imitation: Effect of Visual Context and Stimulus Velocity , 2010, PloS one.

[48]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[50]  Ilaria Gori,et al.  Arm-Hand Behaviours Modelling: From Attention to Imitation , 2010, ISVC.

[51]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[52]  Radu Horaud,et al.  Scene flow estimation by growing correspondence seeds , 2011, CVPR 2011.

[53]  Peyman Milanfar,et al.  Action Recognition from One Example , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[55]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[56]  Bingbing Ni,et al.  Geometric ℓp-norm feature pooling for image classification , 2011, CVPR 2011.

[57]  Venu Govindaraju,et al.  A temporal Bayesian model for classifying, detecting and localizing activities in video sequences , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[58]  Yui Man Lui,et al.  A least squares regression framework on manifolds and its application to gesture recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[59]  Isabelle Guyon,et al.  ChaLearn gesture challenge: Design and first results , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[60]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[61]  Giorgio Metta,et al.  All gestures you can: A memory game against a humanoid robot , 2012, 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012).

[62]  Ling Shao,et al.  One shot learning gesture recognition from RGBD images , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[63]  Sotirios Chatzis,et al.  A conditional random field-based model for joint sequence segmentation and classification , 2013, Pattern Recognit..

[64]  Hafiz Imtiaz,et al.  A template matching approach of one-shot-learning gesture recognition , 2013, Pattern Recognit. Lett..