Enhanced visual scene understanding through human-robot dialog

We propose a novel human-robot-interaction framework for robust visual scene understanding. Without any a-priori knowledge about the objects, the task of the robot is to correctly enumerate how many of them are in the scene and segment them from the background. Our approach builds on top of state-of-the-art computer vision methods, generating object hypotheses through segmentation. This process is combined with a natural dialog system, thus including a 'human in the loop' where, by exploiting the natural conversation of an advanced dialog system, the robot gains knowledge about ambiguous situations. We present an entropy-based system allowing the robot to detect the poorest object hypotheses and query the user for arbitration. Based on the information obtained from the human-robot dialog, the scene segmentation can be re-seeded and thereby improved. We present experimental results on real data that show an improved segmentation performance compared to segmentation without interaction.

[1]  Olga Veksler,et al.  Markov random fields with efficient approximations , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[2]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[3]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[5]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[6]  Giorgio Metta,et al.  Better Vision through Manipulation , 2003, Adapt. Behav..

[7]  Vladimir Kolmogorov,et al.  An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Jens Edlund,et al.  Robust interpretation in the Higgins spoken dialogue system , 2004 .

[9]  R. Zabih,et al.  What energy functions can be minimized via graph cuts , 2004 .

[10]  Jian Sun,et al.  Lazy snapping , 2004, SIGGRAPH 2004.

[11]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[12]  Gabriel Skantze,et al.  GALATEA: A Discourse Modeller Supporting Concept-Level Error Handling in Spoken Dialogue Systems , 2005, SIGDIAL.

[13]  Jan-Olof Eklundh,et al.  Foveated Figure-Ground Segmentation and Its Role in Recognition , 2005, BMVC.

[14]  Jean Scholtz,et al.  Common metrics for human-robot interaction , 2006, HRI '06.

[15]  Ales Ude,et al.  The Karlsruhe Humanoid Head , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[16]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[17]  Oliver Brock,et al.  Manipulating articulated objects with interactive perception , 2008, 2008 IEEE International Conference on Robotics and Automation.

[18]  Philip H. S. Torr,et al.  Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[19]  Danijel Skocaj,et al.  A computer vision integration model for a multi-modal cognitive system , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Alexei A. Efros,et al.  Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships , 2009, NIPS.

[21]  Oliver Brock,et al.  Interactive segmentation for manipulation in unstructured environments , 2009, 2009 IEEE International Conference on Robotics and Automation.

[22]  Trevor Darrell,et al.  Gaussian Processes for Object Categorization , 2010, International Journal of Computer Vision.

[23]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Sylvain Calino,et al.  Robot programming by demonstration : a probabilistic approach , 2009 .

[25]  Danica Kragic,et al.  Attention-based active 3D point cloud segmentation , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Danica Kragic,et al.  An Active Vision System for Detecting, Fixating and Manipulating Objects in the Real World , 2010, Int. J. Robotics Res..

[27]  Danica Kragic,et al.  Learning grasping points with shape context , 2010, Robotics Auton. Syst..

[28]  Pieter Abbeel,et al.  Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding , 2010, 2010 IEEE International Conference on Robotics and Automation.

[29]  Danica Kragic,et al.  Strategies for multi-modal scene exploration , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Gabriel Skantze Jindigo : a Java-based Framework for Incremental Dialogue Systems , 2010 .

[31]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[34]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2011 .