Learning Multi-Object Tracking and Segmentation From Automatic Annotations

In this work we contribute a novel pipeline to automatically generate training data, and to improve over state-of-the-art multi-object tracking and segmentation (MOTS) methods. Our proposed track mining algorithm turns raw street-level videos into high-fidelity MOTS training data, is scalable and overcomes the need of expensive and time-consuming manual annotation approaches. We leverage state-of-the-art instance segmentation results in combination with optical flow predictions, also trained on automatically harvested training data. Our second major contribution is MOTSNet - a deep learning, tracking-by-detection architecture for MOTS - deploying a novel mask-pooling layer for improved object association over time. Training MOTSNet with our automatically extracted data leads to significantly improved sMOTSA scores on the novel KITTI MOTS dataset (+1.9%/+7.5% on cars/pedestrians), and MOTSNet improves by +4.1% over previously best methods on the MOTSChallenge dataset. Our most impressive finding is that we can improve over previous best-performing works, even in complete absence of manually annotated MOTS training data.

[1]  Bastian Leibe,et al.  UnOVOST : Unsupervised Offline Video Object Segmentation and Tracking for the 2019 Unsupervised DAVIS Challenge , 2019 .

[2]  Thomas Brox,et al.  Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[4]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[5]  Ming-Hsuan Yang,et al.  UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking , 2015, Comput. Vis. Image Underst..

[6]  Amanda Berg,et al.  Semi-Automatic Annotation of Objects in Visual-Thermal Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[7]  Philip H. S. Torr,et al.  Video Instance Segmentation 2019: A Winning Approach for Combined Detection, Segmentation, Classification and Tracking. , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[8]  Andrea Simonelli,et al.  Disentangling Monocular 3D Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Lorenzo Porzi,et al.  Seamless Scene Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Hierarchical Discrete Distribution Decomposition for Match Density Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael Felsberg,et al.  A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ning Xu,et al.  YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[16]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[19]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yann LeCun,et al.  Predicting Future Instance Segmentations by Forecasting Convolutional Features , 2018, ECCV.

[21]  Luc Van Gool,et al.  The 2018 DAVIS Challenge on Video Object Segmentation , 2018, ArXiv.

[22]  K. Madhava Krishna,et al.  Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[23]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[24]  Bastian Leibe,et al.  Track, Then Decide: Category-Agnostic Vision-Based Multi-Object Tracking , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Lorenzo Porzi,et al.  In-place Activated BatchNorm for Memory-Optimized Training of DNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Luc Van Gool,et al.  Deep Extreme Cut: From Extreme Points to Object Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Bodo Rosenhahn,et al.  Fusion of Head and Full-Body Detectors for Multi-object Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Bastian Leibe,et al.  Towards Large-Scale Video Object Mining , 2018 .

[29]  Shuicheng Yan,et al.  Predicting Scene Parsing and Motion Dynamics in the Future , 2017, NIPS.

[30]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Bastian Leibe,et al.  Combined image- and world-space tracking in traffic scenes , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[33]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[34]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[35]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Stefan Roth,et al.  Joint Optical Flow and Temporally Consistent Semantic Segmentation , 2016, ECCV Workshops.

[38]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[39]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  James M. Rehg,et al.  Multiple Hypothesis Tracking Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[45]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[48]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[49]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[51]  Shuhan Shen,et al.  Accurate Multiple View 3D Reconstruction Using Patch-Based Stereo for Large-Scale Scenes , 2013, IEEE Transactions on Image Processing.

[52]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..