Enhanced visual scene understanding through human-robot dialog

We propose a novel human-robot-interaction framework for robust visual scene understanding. Without any a-priori knowledge about the objects, the task of the robot is to correctly enumerate how many of them are in the scene and segment them from the background. Our approach builds on top of state-of-the-art computer vision methods, generating object hypotheses through segmentation. This process is combined with a natural dialog system, thus including a 'human in the loop' where, by exploiting the natural conversation of an advanced dialog system, the robot gains knowledge about ambiguous situations. We present an entropy-based system allowing the robot to detect the poorest object hypotheses and query the user for arbitration. Based on the information obtained from the human-robot dialog, the scene segmentation can be re-seeded and thereby improved. We present experimental results on real data that show an improved segmentation performance compared to segmentation without interaction.

[1]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[2]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2009, EACL.

[3]  Jan-Olof Eklundh,et al.  Foveated Figure-Ground Segmentation and Its Role in Recognition , 2005, BMVC.

[4]  Danica Kragic,et al.  Strategies for multi-modal scene exploration , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Giorgio Metta,et al.  Better Vision through Manipulation , 2003, Adapt. Behav..

[6]  Ales Ude,et al.  The Karlsruhe Humanoid Head , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[7]  Danica Kragic,et al.  An Active Vision System for Detecting, Fixating and Manipulating Objects in the Real World , 2010, Int. J. Robotics Res..

[8]  Jian Sun,et al.  Lazy snapping , 2004, SIGGRAPH 2004.

[9]  Gabriel Skantze,et al.  GALATEA: A Discourse Modeller Supporting Concept-Level Error Handling in Spoken Dialogue Systems , 2005, SIGDIAL.

[10]  Trevor Darrell,et al.  Gaussian Processes for Object Categorization , 2010, International Journal of Computer Vision.

[11]  Jean Scholtz,et al.  Common metrics for human-robot interaction , 2006, HRI '06.

[12]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[13]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[14]  Danica Kragic,et al.  Learning grasping points with shape context , 2010, Robotics Auton. Syst..

[15]  Danica Kragic,et al.  Attention-based active 3D point cloud segmentation , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Olga Veksler,et al.  Markov random fields with efficient approximations , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[17]  Pieter Abbeel,et al.  Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding , 2010, 2010 IEEE International Conference on Robotics and Automation.

[18]  Danijel Skocaj,et al.  A computer vision integration model for a multi-modal cognitive system , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Sylvain Calino,et al.  Robot programming by demonstration : a probabilistic approach , 2009 .

[20]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[22]  Jens Edlund,et al.  Robust interpretation in the Higgins spoken dialogue system , 2004 .

[23]  Vladimir Kolmogorov,et al.  An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Fei-Fei Li,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, CVPR.

[25]  Oliver Brock,et al.  Manipulating articulated objects with interactive perception , 2008, 2008 IEEE International Conference on Robotics and Automation.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[28]  Alexei A. Efros,et al.  Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships , 2009, NIPS.

[29]  Oliver Brock,et al.  Interactive segmentation for manipulation in unstructured environments , 2009, 2009 IEEE International Conference on Robotics and Automation.

[30]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.