Enhancing Mobile Voice Assistants with WorldGaze

Contemporary voice assistants require that objects of inter-est be specified in spoken commands. Of course, users are often looking directly at the object or place of interest ? fine-grained, contextual information that is currently unused. We present WorldGaze, a software-only method for smartphones that provides the real-world gaze location of a user that voice agents can utilize for rapid, natural, and precise interactions. We achieve this by simultaneously opening the front and rear cameras of a smartphone. The front-facing camera is used to track the head in 3D, including estimating its direction vector. As the geometry of the front and back cameras are fixed and known, we can raycast the head vector into the 3D world scene as captured by the rear-facing camera. This allows the user to intuitively define an object or region of interest using their head gaze. We started our investigations with a qualitative exploration of competing methods, before developing a functional, real-time implementation. We conclude with an evaluation that shows WorldGaze can be quick and accurate, opening new multimodal gaze+voice interactions for mobile voice agents.

[1]  Albrecht Schmidt,et al.  There is more to context than location , 1999, Comput. Graph..

[2]  Fabrice Matulic,et al.  Unimanual Pen+Touch Input Using Variations of Precision Grip Postures , 2018, UIST.

[3]  Dan Witzner Hansen,et al.  Mobile gaze-based screen interaction in 3D environments , 2011, NGCA '11.

[4]  Sandra G. Hart,et al.  Nasa-Task Load Index (NASA-TLX); 20 Years Later , 2006 .

[5]  Kenneth R. Koedinger,et al.  Evaluation of multimodal input for entering mathematical equations on the computer , 2005, CHI Extended Abstracts.

[6]  Jens Edlund,et al.  The State of Speech in HCI: Trends, Themes and Challenges , 2018, Interact. Comput..

[7]  Shumin Zhai,et al.  Manual and gaze input cascaded (MAGIC) pointing , 1999, CHI '99.

[8]  Gunnar Harboe,et al.  Real-World Affinity Diagramming Practices: Bridging the Paper-Digital Gap , 2015, CHI.

[9]  Scott E. Hudson,et al.  A framework for robust and flexible handling of inputs with uncertainty , 2010, UIST.

[10]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  William Buxton,et al.  Pen + touch = new tools , 2010, UIST.

[12]  Kentaro Go,et al.  Resolving ambiguities of a gaze and speech interface , 2004, ETRA.

[13]  Niels Henze,et al.  Modeling Distant Pointing for Compensating Systematic Displacements , 2015, CHI.

[14]  Kristinn R. Thórisson,et al.  Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures , 1991, AAAI Workshop on Intelligent Multimedia Interfaces.

[15]  Robert Xiao,et al.  Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions , 2015, ICMI.

[16]  Tom Gross,et al.  Just Look: The Benefits of Gaze-Activated Voice Input in the Car , 2018, AutomotiveUI.

[17]  Daniel Vogel,et al.  Distant freehand pointing and clicking on very large, high resolution displays , 2005, UIST.

[18]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Hugh Durrant-Whyte,et al.  Simultaneous localization and mapping (SLAM): part II , 2006 .

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Albrecht Schmidt,et al.  Eye-gaze interaction for mobile phones , 2007, Mobility '07.

[22]  Niels Henze,et al.  Up to the Finger Tip: The Effect of Avatars on Mid-Air Pointing Accuracy in Virtual Reality , 2018, CHI PLAY.

[23]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Alastair G. Gale,et al.  Human Response to Visual Stimuli , 1993 .

[25]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hans-Werner Gellersen,et al.  Gaze-touch: combining gaze with multi-touch for interaction on the same surface , 2014, UIST.

[27]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[28]  Niels Henze,et al.  The Effect of Offset Correction and Cursor on Mid-Air Pointing in Real and Virtual Environments , 2018, CHI.

[29]  Michael Feld,et al.  Combining Speech, Gaze, and Micro-gestures for the Multimodal Control of In-Car Functions , 2016, 2016 12th International Conference on Intelligent Environments (IE).

[30]  Enrico Rukzio,et al.  Towards accurate cursorless pointing: the effects of ocular dominance and handedness , 2018, Personal and Ubiquitous Computing.

[31]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Tanya Beelders,et al.  The Usability of Speech and Eye Gaze as a Multimodal Interface for a Word Processor , 2011 .

[33]  Valentin Schwind,et al.  EyePointing: A Gaze-Based Selection Technique , 2019, MuC.

[34]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ann Blandford,et al.  Qualitative HCI Research: Going Behind the Scenes , 2016, Synthesis Lectures on Human-Centered Informatics.

[36]  Wojciech Matusik,et al.  Eye Tracking for Everyone , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  David Rozado VoxVisio – Combining Gaze and Speech for Accessible HCI , 2016 .

[39]  José Ruíz Ascencio,et al.  Visual simultaneous localization and mapping: a survey , 2012, Artificial Intelligence Review.

[40]  I. Scott MacKenzie,et al.  Speech-augmented eye gaze interaction with small closely spaced targets , 2006, ETRA.

[41]  James Hays,et al.  WebGazer: Scalable Webcam Eye Tracking Using User Interactions , 2016, IJCAI.

[42]  Robert J. K. Jacob,et al.  Eye tracking in advanced interface design , 1995 .

[43]  Hans-Werner Gellersen,et al.  Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements , 2015, UIST.

[44]  Yuanchun Shi,et al.  Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands , 2018, UIST.

[45]  J. B. Brooke,et al.  SUS: A 'Quick and Dirty' Usability Scale , 1996 .

[46]  Albrecht Schmidt,et al.  Multimodal interaction in the car: combining speech and gestures on the steering wheel , 2012, AutomotiveUI.

[47]  J. J. Higgins,et al.  The aligned rank transform for nonparametric factorial analyses using only anova procedures , 2011, CHI.