How Incompletely Segmented Information Affects Multi-Object Tracking and Segmentation (MOTS)

In recent years, deep learning has made dramatic advances in computer vision field, especially in improving the performance of object detection as well as instance semantic segmentation. Still, multi-object tracking (MOT) remains a very challenging issue. Even in state-of-the-art deep learning-based object detectors, a preferred paradigm for MOT: tracking-by-detection, can only slightly improve the tracking performance. Pixel-level information is considered more precise and useful for tracking performance improvement than using conventional information, such as foreground or background content in a bounding box. However, the performance of current state-of-the-art models for automatically annotating pixel-level information is still far from the expectation of human beings. Therefore, we shall explore how multi-object tracking and segmentation (MOTS) is affected when the information obtained after applying instance semantic segmentation is incomplete. We propose a mask-guided two-streamed augmentation learning (MGTSAL) algorithm, which can be applied to TrackR-CNN to alleviate significant drop of MOTS performance when encountering incompletely segmented information. We evaluate the proposed approach on MOTS KITTI dataset, and our approach outperforms the baseline model TrackR-CNN in all our experimental settings. The promising experimental results and ablation study validate the effectiveness of the proposed approach.

[1]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jitendra Malik,et al.  Tracking as Repeated Figure/Ground Segmentation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[4]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[6]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[7]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jian Yang,et al.  Person Search via A Mask-Guided Two-Stream CNN Model , 2018, ECCV.

[9]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[13]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hua Yang,et al.  An Unsupervised-Learning-Based Approach for Automated Defect Inspection on Textured Surfaces , 2018, IEEE Transactions on Instrumentation and Measurement.

[17]  Jianren Wang,et al.  Prediction-Tracking-Segmentation , 2019, ArXiv.

[18]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[19]  Alex Bewley,et al.  Deep Cosine Metric Learning for Person Re-identification , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Zhi Tang,et al.  CBNet: A Novel Composite Backbone Network Architecture for Object Detection , 2019, AAAI.