Toward Unconstrained Gesture and Language Interfaces

In this work, we investigate how people refer to objects in the world during relatively unstructured communication. We collect a corpus of object descriptions from non-expert users using language and naturalistic gesture to identify objects and attributes. This corpus is used to learn language and gesture models that enable our system to identify the objects referred to by a user. We demonstrate that combining multiple communication modalities is more effective for understanding user intent than focusing on only one type of input, and discuss the implications of these results on developing natural interfaces for interacting with physical agents.

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Bastian Leibe,et al.  MIND-WARPING: towards creating a compelling collaborative augmented reality game , 2000, IUI '00.

[3]  Sonia Chernova,et al.  Humanoid robot control using depth camera , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[4]  Henrik I. Christensen,et al.  Bringing Together Human and Robotic Environment Representations - A Pilot Study , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[6]  James Fogarty,et al.  Examining interaction with general-purpose object recognition in LEGO OASIS , 2011, 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[7]  Yael Edan,et al.  Vision-based hand-gesture applications , 2011, Commun. ACM.

[8]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[9]  Dieter Fox,et al.  RGB-D Object Recognition: Features, Algorithms, and a Large Scale Benchmark , 2013, Consumer Depth Cameras for Computer Vision.

[10]  Moritz Tenorth,et al.  A unified representation for reasoning about robot actions, processes, and their effects on objects , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Arnold M. Lund,et al.  The future of natural user interfaces , 2011, CHI Extended Abstracts.

[12]  Dieter Fox,et al.  Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms , 2011, NIPS.

[13]  Luke S. Zettlemoyer,et al.  Bootstrapping Semantic Parsers from Conversations , 2011, EMNLP.

[14]  Raymond J. Mooney,et al.  Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision , 2012, EMNLP.

[15]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[16]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[17]  Myron W. Krueger,et al.  Artificial reality II , 1991 .

[18]  Alessio Malizia,et al.  The artificiality of natural user interfaces , 2012, CACM.

[19]  N. Vidakis,et al.  Multimodal natural user interaction for multiple applications: The gesture — Voice example , 2012, 2012 International Conference on Telecommunications and Multimedia (TEMU).

[20]  S. Mitra,et al.  Gesture Recognition: A Survey , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Vladimir A. Kulyukin,et al.  On natural language dialogue with assistive robots , 2006, HRI '06.

[23]  Gregory D. Abowd,et al.  Charting past, present, and future research in ubiquitous computing , 2000, TCHI.

[24]  Magdalena D. Bugajska,et al.  Building a Multimodal Human-Robot Interface , 2001, IEEE Intell. Syst..

[25]  Juan Carlos Augusto,et al.  Ambient Intelligence—the Next Step for Artificial Intelligence , 2008, IEEE Intelligent Systems.

[26]  Donald A. Norman,et al.  Natural user interfaces are not natural , 2010, INTR.

[27]  Noëlle Carbonell,et al.  An experimental study of future “natural” multimodal human-computer interaction , 1993, CHI '93.

[28]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[29]  Dieter Fox,et al.  Attribute based object identification , 2013, 2013 IEEE International Conference on Robotics and Automation.

[30]  Glenn Taylor,et al.  A multi-modal intelligent user interface for supervisory control of unmanned platforms , 2012, 2012 International Conference on Collaboration Technologies and Systems (CTS).

[31]  Matthias Scheutz,et al.  Tell me when and why to do it! Run-time planner model updates via natural language instruction , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[32]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[33]  Adam Cheyer,et al.  Multimodal Maps: An Agent-Based Approach , 1995, Multimodal Human-Computer Communication.

[34]  Wolfram Burgard,et al.  Experiences with an Interactive Museum Tour-Guide Robot , 1999, Artif. Intell..

[35]  Anara Sandygulova,et al.  Immersive human-robot interaction , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[36]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[37]  Korten Kamp,et al.  Recognizing and interpreting gestures on a mobile robot , 1996, AAAI 1996.