论文信息 - Keep it simple and sparse: real-time action recognition

Keep it simple and sparse: real-time action recognition

Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective real-time system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, "All Gestures You Can", to be played against a humanoid robot.

[1] H. Hirschfeld. A Connection between Correlation and Contingency , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[2] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3] S. Chiba,et al. Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[4] N. Otsu. A threshold selection method from gray level histograms , 1979 .

[5] Berthold K. P. Horn,et al. Determining Optical Flow , 1981, Other Conferences.

[6] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[7] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[8] P. Rochat. Early Social Cognition : Understanding Others in the First Months of Life , 1999 .

[9] Cesare Comoldi,et al. Strategic memory deficits in attention deficit disorder with hyperactivity participants: The role of executive processes , 1999 .

[10] W. Eric L. Grimson,et al. Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[11] Jake K. Aggarwal,et al. Segmentation and recognition of continuous human activity , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[12] James W. Davis,et al. The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[13] Gunnar Farnebäck,et al. Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[14] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16] T. Poggio,et al. Cognitive neuroscience: Neural mechanisms for the recognition of biological movements , 2003, Nature Reviews Neuroscience.

[17] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[19] Ling Guan,et al. Continuous human activity recognition , 2004, ICARCV 2004 8th Control, Automation, Robotics and Vision Conference, 2004..

[20] Ronen Basri,et al. Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[21] Thomas Brox,et al. Universität Des Saarlandes Fachrichtung 6.1 – Mathematik Highly Accurate Optic Flow Computation with Theoretically Justified Warping Highly Accurate Optic Flow Computation with Theoretically Justified Warping , 2022 .

[22] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23] M. J. Burden,et al. Implicit Memory Development in School-Aged Children With Attention Deficit Hyperactivity Disorder (ADHD): Conceptual Priming Deficit? , 2005, Developmental neuropsychology.

[24] Sheng-Wen Shih,et al. Continuous Human Action Segmentation and Recognition Using a Spatio-Temporal Probabilistic Framework , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[25] Paul Lukowicz,et al. Performance Metrics and Evaluation Issues for Continuous Activity Recognition , 2006 .

[26] Giorgio Metta,et al. YARP: Yet Another Robot Platform , 2006 .

[27] Michael Elad,et al. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[28] Rajat Raina,et al. Efficient sparse coding algorithms , 2006, NIPS.

[29] Pia Borlund,et al. Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007, J. Assoc. Inf. Sci. Technol..

[30] Ramakant Nevatia,et al. Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Ramakant Nevatia,et al. Coupled Hidden Semi Markov Models for Activity Recognition , 2007, 2007 IEEE Workshop on Motion and Video Computing (WMVC'07).

[32] Jesper W. Schneider,et al. Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007 .

[33] Manik Varma,et al. Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34] Frederic Devernay,et al. A Variational Method for Scene Flow Estimation from Stereo Sequences , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[35] H. Hirschmüller. Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[36] Guillermo Sapiro,et al. Discriminative learned dictionaries for local image analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38] Luc Van Gool,et al. An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[39] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[40] Giulio Sandini,et al. The iCub humanoid robot: an open platform for research in embodied cognition , 2008, PerMIS.

[41] Michael Elad,et al. Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.

[42] S. Gong,et al. Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Yihong Gong,et al. Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Stan Sclaroff,et al. A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Francesca Odone,et al. A Sparsity-Enforcing Method for Learning Face Features , 2009, IEEE Transactions on Image Processing.

[46] Daniel Cremers,et al. Stereoscopic Scene Flow Computation for 3D Motion Understanding , 2011, International Journal of Computer Vision.

[47] L. Fadiga,et al. Automatic versus Voluntary Motor Imitation: Effect of Visual Context and Stimulus Velocity , 2010, PloS one.

[48] Yihong Gong,et al. Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49] Wanqing Li,et al. Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[50] Ilaria Gori,et al. Arm-Hand Behaviours Modelling: From Attention to Imitation , 2010, ISVC.

[51] Ronald Poppe,et al. A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[52] Radu Horaud,et al. Scene flow estimation by growing correspondence seeds , 2011, CVPR 2011.

[53] Peyman Milanfar,et al. Action Recognition from One Example , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[55] Toby Sharp,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[56] Bingbing Ni,et al. Geometric ℓp-norm feature pooling for image classification , 2011, CVPR 2011.

[57] Venu Govindaraju,et al. A temporal Bayesian model for classifying, detecting and localizing activities in video sequences , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[58] Yui Man Lui,et al. A least squares regression framework on manifolds and its application to gesture recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[59] Isabelle Guyon,et al. ChaLearn gesture challenge: Design and first results , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[60] Ying Wu,et al. Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[61] Giorgio Metta,et al. All gestures you can: A memory game against a humanoid robot , 2012, 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012).

[62] Ling Shao,et al. One shot learning gesture recognition from RGBD images , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[63] Sotirios Chatzis,et al. A conditional random field-based model for joint sequence segmentation and classification , 2013, Pattern Recognit..

[64] Hafiz Imtiaz,et al. A template matching approach of one-shot-learning gesture recognition , 2013, Pattern Recognit. Lett..