An Unsupervised Framework for Action Recognition Using Actemes

In speech recognition, phonemes have demonstrated their efficacy to model the words of a language. While they are well defined for languages, their extension to human actions is not straightforward. In this paper, we study such an extension and propose an unsupervised framework to find phoneme-like units for actions, which we call actemes, using 3D data and without any prior assumptions. To this purpose, build on an earlier proposed framework in speech literature to automatically find actemes in the training data. We experimentally show that actions defined in terms of actemes and actions defined by whole units give similar recognition results. We define actions out of the training set in terms of these actemes to see whether the actemes generalize to unseen actions. The results show that although the acteme definitions of the actions are not always semantically meaningful, they yield optimal recognition accuracy and constitute a promising direction of research for action modeling.

[1]  Rama Chellappa,et al.  From Videos to Verbs : Mining Videos for Events using a Cascade of Dynamical Systems , 2007 .

[2]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  V. Ramasubramanian,et al.  Acoustic modeling by phoneme templates and modified one-pass DP decoding for continuous speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Rémi Ronfard,et al.  Automatic Discovery of Action Taxonomies from Multiple Views , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Thippur V. Sreenivas,et al.  Automatically derived units for segment vocoders , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  V. Ramasubramanian,et al.  A Framework for Indexing Human Actions in Video , 2008 .

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Yiannis Aloimonos,et al.  View-Invariant Modeling and Recognition of Human Actions Using Grammars , 2006, WDV.

[9]  Ling Guan,et al.  Quantifying and recognizing human movement patterns from monocular video Images-part I: a new framework for modeling human motion , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Yiannis Aloimonos,et al.  A Language for Human Action , 2007, Computer.

[11]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Ashok Veeraraghavan,et al.  The Function Space of an Activity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Rama Chellappa,et al.  From Videos to Verbs: Mining Videos for Activities using a Cascade of Dynamical Systems , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  J. Sullivan,et al.  Action Recognition by Shape Matching to Key Frames , 2002 .

[15]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Fritz Class,et al.  A learning procedure for speaker-dependent word recognition systems based on sequential processing of input tokens , 1983, ICASSP.

[17]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[19]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Rama Chellappa,et al.  Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[22]  Rama Chellappa,et al.  Locally time-invariant models of human activities using trajectories on the grassmannian , 2009, CVPR.

[23]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[25]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..