Video Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting

Video segmentation is a challenging task that has many applications in robotics. Learning segmentation from few examples on-line is important for robotics in unstructured environments. The total number of objects and their variation in the real world is intractable, but for a specific task the robot deals with a small subset. Our network is taught, by a human moving a hand-held object through different poses. A novel two-stream motion and appearance ”teacher” network provides pseudo-labels. These labels are used to adapt an appearance ”student” network. Segmentation can be used to support a variety of robot vision functionality, such as grasping or affordance segmentation. We propose different variants of motion adaptation training and extensively compare against the state-of-the-art methods. We collected a carefully designed dataset in the human robot interaction (HRI) setting. We denote our dataset as (L)ow-shot (O)bject (R)ecognition, (D)etection and (S)egmentation using HRI. Our dataset contains teaching videos of different hand-held objects moving in translation, scale and rotation. It contains kitchen manipulation tasks as well, performed by humans and robots. Our proposed method outperforms the state-of-the-art on DAVIS and FBMS with 7% and 1.2% in F-measure respectively. In our more challenging LORDS-HRI dataset, our approach achieves significantly better performance with 46.7% and 24.2% relative improvement in mIoU over the baseline.

[1]  Lorenzo Rosasco,et al.  Teaching iCub to recognize objects using deep Convolutional Neural Networks , 2015, MLIS@ICML.

[2]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[4]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sanja Fidler,et al.  Annotating Object Instances with a Polygon-RNN , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[7]  Anton van den Hengel,et al.  Bridging Category-level and Instance-level Semantic Image Segmentation , 2016, ArXiv.

[8]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Lorenzo Rosasco,et al.  Object identification from few examples by improving the invariance of a Deep Convolutional Neural Network , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Derek D. Lichti,et al.  Performance Assessment and Calibration of the Kinect 2.0 Time-of-Flight Range Camera for Use in Motion Capture Applications , 2015 .

[11]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Brian Taylor,et al.  Causal video object segmentation from persistence of occlusions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Ellen M. Markman,et al.  Categorization and Naming in Children: Problems of Induction , 1989 .

[19]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[20]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[22]  Matthew Richardson,et al.  Do Deep Convolutional Nets Really Need to be Deep and Convolutional? , 2016, ICLR.

[23]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Oliver Brock,et al.  Interactive segmentation for manipulation in unstructured environments , 2009, 2009 IEEE International Conference on Robotics and Automation.

[29]  Pieter Abbeel,et al.  BigBIRD: A large-scale 3D database of object instances , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Anton van den Hengel,et al.  Wider or Deeper: Revisiting the ResNet Model for Visual Recognition , 2016, Pattern Recognit..

[33]  Thomas Brox,et al.  Motion Trajectory Segmentation via Minimum Cost Multicuts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[36]  Davide Maltoni,et al.  CORe50: a New Dataset and Benchmark for Continuous Object Recognition , 2017, CoRL.