Combining body pose, gaze, and gesture to determine intention to interact in vision-based interfaces

Vision-based interfaces, such as those made popular by the Microsoft Kinect, suffer from the Midas Touch problem: every user motion can be interpreted as an interaction. In response, we developed an algorithm that combines facial features, body pose and motion to approximate a user's intention to interact with the system. We show how this can be used to determine when to pay attention to a user's actions and when to ignore them. To demonstrate the value of our approach, we present results from a 30-person lab study conducted to compare four engagement algorithms in single and multi-user scenarios. We found that combining intention to interact with a 'raise an open hand in front of you' gesture yielded the best results. The latter approach offers a 12% improvement in accuracy and a 20% reduction in time to engage over a baseline 'wave to engage' gesture currently used on the Xbox 360.

[1]  A. F. Adams,et al.  The Survey , 2021, Dyslexia in Higher Education.

[2]  Andrea Kleinsmith,et al.  Affective Body Expression Perception and Recognition: A Survey , 2013, IEEE Transactions on Affective Computing.

[3]  Jörg Müller,et al.  StrikeAPose: revealing mid-air gestures on public displays , 2013, CHI.

[4]  Kostas Karpouzis,et al.  Feature Extraction and Selection for Inferring User Engagement in an HCI Environment , 2009, HCI.

[5]  Björn Hartmann,et al.  Pictionaire: supporting collaborative design work by integrating physical and digital artifacts , 2010, CSCW '10.

[6]  Nadia Bianchi-Berthouze,et al.  What Can Body Movement Tell Us About Players' Engagement? , 2012 .

[7]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[8]  Rosalind W. Picard,et al.  Automated Posture Analysis for Detecting Learner's Interest Level , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[9]  Joyojeet Pal,et al.  Multiple mice for retention tasks in disadvantaged schools , 2007, CHI.

[10]  Candace L. Sidner,et al.  Recognizing engagement in human-robot interaction , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12]  Mubarak Shah,et al.  Determining driver visual attention with one camera , 2003, IEEE Trans. Intell. Transp. Syst..

[13]  Marek P. Michalowski,et al.  A spatial model of engagement for a social robot , 2006, 9th IEEE International Workshop on Advanced Motion Control, 2006..

[14]  Yuri Ivanov,et al.  Probabilistic combination of multiple modalities to detect interest , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  John Markus Bjørndalen,et al.  Gesture-Based, Touch-Free Multi-User Gaming on Wall-Sized, High-Resolution Tiled Displays , 2008, J. Virtual Real. Broadcast..

[16]  Ana Paiva,et al.  Automatic analysis of affective postures and body motion to detect engagement with a game companion , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  Scott E. Hudson,et al.  A framework for robust and flexible handling of inputs with uncertainty , 2010, UIST.

[18]  Rick Kjeldsen,et al.  Design issues for vision-based computer interaction systems , 2001, PUI '01.

[19]  Ken Hinckley,et al.  A survey of design issues in spatial input , 1994, UIST '94.

[20]  Eric Horvitz,et al.  Dialog in the open world: platform and applications , 2009, ICMI-MLMI '09.

[21]  Nadia Bianchi-Berthouze,et al.  Understanding the Role of Body Movement in Player Engagement , 2012, Hum. Comput. Interact..

[22]  Yukiko I. Nakano,et al.  Estimating user's engagement from eye-gaze behaviors in human-agent conversations , 2010, IUI '10.

[23]  Eric Horvitz,et al.  Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings , 2009, SIGDIAL Conference.