One-Shot Informed Robotic Visual Search in the Wild

We consider the task of underwater robot navigation for the purpose of collecting scientifically relevant video data for environmental monitoring. The majority of field robots that currently perform monitoring tasks in unstructured natural environments navigate via path-tracking a pre-specified sequence of waypoints. Although this navigation method is often necessary, it is limiting because the robot does not have a model of what the scientist deems to be relevant visual observations. Thus, the robot can neither visually search for particular types of objects, nor focus its attention on parts of the scene that might be more relevant than the pre-specified waypoints and viewpoints. In this paper we propose a method that enables informed visual navigation via a learned visual similarity operator that guides the robot's visual search towards parts of the scene that look like an exemplar image, which is given by the user as a high-level specification for data collection. We propose and evaluate a weakly supervised video representation learning method that outperforms ImageNet embeddings for similarity tasks in the underwater domain. We also demonstrate the deployment of this similarity operator during informed visual navigation in collaborative environmental monitoring scenarios, in large-scale field trials, where the robot and a human scientist collaboratively search for relevant visual content.

[1]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Paul Newman,et al.  Visual precis generation using coresets , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Gregory Dudek,et al.  Vision-Based Autonomous Underwater Swimming in Dense Coral for Combined Collision Avoidance and Target Selection , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  T. Foulsham,et al.  Eye movements during scene inspection: A test of the saliency map hypothesis , 2006 .

[6]  Gregory Dudek,et al.  MARE: Marine Autonomous Robotic Explorer , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Vincent Dupourqué,et al.  A robot operating system , 1984, ICRA.

[9]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Gregory Dudek,et al.  3D trajectory synthesis and control for a legged swimming robot , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Li Zhaoping,et al.  Understanding Vision: Theory, Models, and Data , 2014 .

[12]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[13]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hugh F. Durrant-Whyte,et al.  Coordinated decentralized search for a lost target in a Bayesian world , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[15]  Gordon Cheng,et al.  Attention-based active visual search for mobile robots , 2018, Autonomous Robots.

[16]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[17]  Ryan N. Smith,et al.  Enabling persistent autonomy for underwater gliders through terrain based navigation , 2015, OCEANS 2015 - Genova.

[18]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[20]  Lucas Paletta,et al.  Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint , 2008, Lecture Notes in Computer Science.

[21]  Stefan B. Williams,et al.  Autonomous exploration of large-scale benthic environments , 2013, 2013 IEEE International Conference on Robotics and Automation.

[22]  Patrick Pérez,et al.  Unsupervised Image Matching and Object Discovery as Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sven J. Dickinson,et al.  Active Object Recognition Integrating Attention and Viewpoint Control , 1997, Comput. Vis. Image Underst..

[24]  Gregory Dudek,et al.  Vision in 3D Environments: A surprising problem in navigation , 2011 .

[25]  James J. Little,et al.  Informed visual search: Combining attention and object recognition , 2008, 2008 IEEE International Conference on Robotics and Automation.

[26]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[27]  Zaïd Harchaoui,et al.  Object Discovery in Videos as Foreground Motion Clustering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Gregory Dudek,et al.  GPU-Assisted Learning on an Autonomous Marine Robot for Vision-Based Navigation and Image Understanding , 2018, OCEANS 2018 MTS/IEEE Charleston.

[30]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[31]  David Whitney,et al.  Asymmetric Rendezvous Search at Sea , 2014, 2014 Canadian Conference on Computer and Robot Vision.

[32]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[33]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[34]  John K. Tsotsos,et al.  Visual search for an object in a 3D environment using a mobile robot , 2010, Comput. Vis. Image Underst..

[35]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.

[36]  Cordelia Schmid,et al.  Local Convolutional Features with Unsupervised Training for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Yan Lu,et al.  Local Descriptors Optimized for Average Precision , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Gregory Dudek,et al.  Underwater multi-robot convoying using visual tracking by detection , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Gregory Dudek,et al.  Efficient on-line data summarization using extremum summaries , 2012, 2012 IEEE International Conference on Robotics and Automation.

[40]  Gregory Dudek,et al.  Exploring Underwater Environments with Curiosity , 2014, 2014 Canadian Conference on Computer and Robot Vision.

[41]  Antonio Torralba,et al.  Top-down control of visual attention in object detection , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[42]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[44]  Jongtack Kim,et al.  Combination of Multiple Global Descriptors for Image Retrieval , 2019, ArXiv.

[45]  Gaurav S. Sukhatme,et al.  Adaptive Sampling: Algorithmic vs. Human Waypoint Selection , 2018, ICRA 2018.

[46]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  J. Wolfe,et al.  Guided Search 2.0 A revised model of visual search , 1994, Psychonomic bulletin & review.

[49]  Yannis Avrithis,et al.  Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Stefan B. Williams,et al.  Reflections on a decade of autonomous underwater vehicles operations for marine survey at the Australian Centre for Field Robotics , 2016, Annu. Rev. Control..

[51]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[54]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[55]  Jon Almazán,et al.  Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[57]  Gregory Dudek,et al.  Enabling autonomous capabilities in underwater robotics , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[58]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Gregory Dudek,et al.  Multi-domain monitoring of marine environments using a heterogeneous robot team , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[60]  Florian Shkurti,et al.  Vision-Based Goal-Conditioned Policies for Underwater Navigation in the Presence of Obstacles , 2020, RSS 2020.

[61]  Stefan B. Williams,et al.  Towards autonomous habitat classification using Gaussian Mixture Models , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[62]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[63]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[64]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[65]  Nuno Vasconcelos,et al.  Discriminant Saliency for Visual Recognition from Cluttered Scenes , 2004, NIPS.

[66]  Stefan B. Williams,et al.  Automated registration for multi-year robotic surveys of marine benthic habitats , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[67]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[68]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .