A Passive Learning Sensor Architecture for Multimodal Image Labeling: An Application for Social Robots

Object detection and classification have countless applications in human–robot interacting systems. It is a necessary skill for autonomous robots that perform tasks in household scenarios. Despite the great advances in deep learning and computer vision, social robots performing non-trivial tasks usually spend most of their time finding and modeling objects. Working in real scenarios means dealing with constant environment changes and relatively low-quality sensor data due to the distance at which objects are often found. Ambient intelligence systems equipped with different sensors can also benefit from the ability to find objects, enabling them to inform humans about their location. For these applications to succeed, systems need to detect the objects that may potentially contain other objects, working with relatively low-resolution sensor data. A passive learning architecture for sensors has been designed in order to take advantage of multimodal information, obtained using an RGB-D camera and trained semantic language models. The main contribution of the architecture lies in the improvement of the performance of the sensor under conditions of low resolution and high light variations using a combination of image labeling and word semantics. The tests performed on each of the stages of the architecture compare this solution with current research labeling techniques for the application of an autonomous social robot working in an apartment. The results obtained demonstrate that the proposed sensor architecture outperforms state-of-the-art approaches.

[1]  Stefanie Tellex,et al.  Interpreting and Executing Recipes with a Cooking Robot , 2012, ISER.

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  Geoffrey E. Hinton,et al.  Visualizing non-metric similarities in multiple maps , 2011, Machine Learning.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[6]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7]  Rachid Alami,et al.  A framework for endowing an interactive robot with reasoning capabilities about perspective-taking and belief management , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[8]  Keith L. Doty,et al.  SWEEP STRATEGIES FOR A SENSORY-DRIVEN, BEHAVIOR-BASED VACUUM CLEANING AGENT by , 1993 .

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[11]  Nobuyuki Kita,et al.  Strategy for Folding Clothing on the Basis of Deformable Models , 2014, ICIAR.

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Rafael E. Banchs,et al.  Perceptive Parallel Processes Coordinating Geometry and Texture , 2015, MuSRobS@IROS.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Manso Fernández-Argüelles,et al.  Perception as stochastic grammar-based sampling on dynamic grahp spaces , 2013 .

[16]  Jos Elfring,et al.  Active Object Search Exploiting Probabilistic Object-Object Relations , 2013, RoboCup.

[17]  Miguel Cazorla,et al.  Scene classification based on semantic labeling , 2016, Adv. Robotics.

[18]  Gregory D. Abowd,et al.  The Aware Home: A Living Laboratory for Ubiquitous Computing Research , 1999, CoBuild.

[19]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Tully Foote,et al.  tf: The transform library , 2013, 2013 IEEE Conference on Technologies for Practical Robot Applications (TePRA).

[22]  Pouria Khosravi,et al.  Investigating the effectiveness of technologies applied to assist seniors: A systematic literature review , 2016, Int. J. Medical Informatics.

[23]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Danica Kragic,et al.  Object Search and Localization for an Indoor Mobile Robot , 2009, J. Comput. Inf. Technol..

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[27]  H. S. Wolff,et al.  iRun: Horizontal and Vertical Shape of a Region-Based Graph Compression , 2022, Sensors.

[28]  Olivier Stasse,et al.  Online object search with a humanoid robot , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  R. Cotterill Cooperation of the basal ganglia, cerebellum, sensory cerebrum and hippocampus: possible implications for cognition, consciousness, intelligence and creativity , 2001, Progress in Neurobiology.

[30]  G. Woodman,et al.  The role of working memory and long-term memory in visual search , 2006 .

[31]  Patric Jensfelt,et al.  Active Visual Object Search in Unknown Environments Using Uncertain Semantics , 2013, IEEE Transactions on Robotics.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Antonio Bandera Rubio,et al.  Use and advances in the Active Grammar-based Modeling architecture , 2017 .