Instructing people for training gestural interactive systems

Entertainment and gaming systems such as the Wii and XBox Kinect have brought touchless, body-movement based interfaces to the masses. Systems like these enable the estimation of movements of various body parts from raw inertial motion or depth sensor data. However, the interface developer is still left with the challenging task of creating a system that recognizes these movements as embodying meaning. The machine learning approach for tackling this problem requires the collection of data sets that contain the relevant body movements and their associated semantic labels. These data sets directly impact the accuracy and performance of the gesture recognition system and should ideally contain all natural variations of the movements associated with a gesture. This paper addresses the problem of collecting such gesture datasets. In particular, we investigate the question of what is the most appropriate semiotic modality of instructions for conveying to human subjects the movements the system developer needs them to perform. The results of our qualitative and quantitative analysis indicate that the choice of modality has a significant impact on the performance of the learnt gesture recognition system; particularly in terms of correctness and coverage.

[1]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[2]  R. C. Macridis A review , 1963 .

[3]  Seong-Whan Lee,et al.  A full-body gesture database for automatic gesture recognition , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[4]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[5]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Marjorie Skubic,et al.  Evaluation of an inexpensive depth camera for passive in-home fall risk assessment , 2011, 2011 5th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops.

[7]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[8]  Bhuvana Ramabhadran,et al.  Issues Involved In Voicemail Data Collection , 1998 .

[9]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[10]  D. McNeill Hand and Mind: What Gestures Reveal about Thought , 1992 .

[11]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[12]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Theo van Leeuwen,et al.  Reading Images: The Grammar of Visual Design , 1996 .

[14]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[15]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[17]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .

[18]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[20]  Charles Sanders Peirce,et al.  On a New List of Categories , 2006 .

[21]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[23]  Emiko Charbonneau,et al.  Teach me to dance: exploring player experience and performance in full body dance games , 2011, Advances in Computer Entertainment Technology.

[24]  Robert Harle,et al.  Modeling the Model Athlete: Automatic Coaching of Rowing Technique , 2008, SSPR/SPR.

[25]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[28]  Ann Hutchinson Guest,et al.  Labanotation : or, Kinetography Laban : the system of analyzing and recording movement , 1970 .

[29]  Sadaoki Furui,et al.  Why Is the Recognition of Spontaneous Speech so Hard? , 2005, TSD.

[30]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.