The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities

This paper describes a framework for modeling human activities as temporally structured processes. Our approach is motivated by the inherently hierarchical nature of human activities and the close correspondence between human actions and speech: We model action units using Hidden Markov Models, much like words in speech. These action units then form the building blocks to model complex human activities as sentences using an action grammar. To evaluate our approach, we collected a large dataset of daily cooking activities: The dataset includes a total of 52 participants, each performing a total of 10 cooking activities in multiple real-life kitchens, resulting in over 77 hours of video footage. We evaluate the HTK toolkit, a state-of-the-art speech recognition engine, in combination with multiple video feature descriptors, for both the recognition of cooking activities (e.g., making pancakes) as well as the semantic parsing of videos into action units (e.g., cracking eggs). Our results demonstrate the benefits of structured temporal generative approaches over existing discriminative approaches in coping with the complexity of human daily life activities.

[1]  Baoxin Li,et al.  Relative Hidden Markov Models for Evaluating Motion Skill , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Rita Cucchiara,et al.  HMM Based Action Recognition with Projection Histogram Features , 2010, ICPR Contests.

[7]  Michael M. Resch,et al.  High performance computing in science and engineering , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[8]  Dana Kulic,et al.  Incremental learning of human behaviors using hierarchical hidden Markov models , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  A. Giusti,et al.  Action Recognition by Imprecise Hidden Markov Models , 2011 .

[10]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[12]  Hermann Ney,et al.  Tracking Benchmark Databases for Video-Based Sign Language Recognition , 2010, ECCV Workshops.

[13]  Alessandro Giusti,et al.  Robust classification of multivariate time series by imprecise hidden Markov models , 2015, Int. J. Approx. Reason..

[14]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[15]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[18]  Steve Young,et al.  The HTK book , 1995 .

[19]  Daisuke Deguchi,et al.  Kitchen Scene Context Based Gesture Recognition: A Contest in ICPR2012 , 2012, WDIA.

[20]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[21]  Sebastian Stüker,et al.  Quaero 2010 Speech-to-Text Evaluation Systems , 2011, High Performance Computing in Science and Engineering.

[22]  B. Tversky,et al.  The shape of action. , 2011, Journal of experimental psychology. General.

[23]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[24]  Jake K. Aggarwal,et al.  Modeling human activities as speech , 2011, CVPR 2011.

[25]  Syed Atif Mehdi,et al.  Sign language recognition using sensor gloves , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[28]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.