Learning to Track Object Position through Occlusion

Occlusion is one of the most significant challenges encountered by object detectors and trackers. While both object detection and tracking has received a lot of attention in the past, most existing methods in this domain do not target detecting or tracking objects when they are occluded. However, being able to detect or track an object of interest through occlusion has been a long standing challenge for different autonomous tasks. Traditional methods that employ visual object trackers with explicit occlusion modeling experience drift and make several fundamental assumptions about the data. We propose to address this with a ‘tracking-by-detection‘ approach that builds upon the success of region based video object detectors. Our video level object detector uses a novel recurrent computational unit at its core that enables long term propagation of object features even under occlusion. Finally, we compare our approach with existing state-of-the-art video object detectors and show that our approach achieves superior results on a dataset of furniture assembly videos collected from the internet, where small objects like screws, nuts, and bolts often get occluded from the camera viewpoint.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[5]  Bo Hu,et al.  Robust Occlusion Handling in Object Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  David J. Kriegman,et al.  Synthetic Aperture Tracking: Tracking through Occlusions , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Yichen Wei,et al.  Towards High Performance Video Object Detection for Mobiles , 2018, ArXiv.

[14]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[15]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[18]  Yan Huang,et al.  Tracking multiple objects through occlusions , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Martin Jägersand,et al.  Convolutional gated recurrent networks for video segmentation , 2016, 2017 IEEE International Conference on Image Processing (ICIP).

[21]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[22]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[23]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[25]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Martial Hebert,et al.  Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[30]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Enkhbayar Erdenee,et al.  Multi-class Multi-object Tracking Using Changing Point Detection , 2016, ECCV Workshops.

[32]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[33]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[34]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.