ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings

We present ClusterVO, a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects. Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving. At the core of our system lies a multi-level probabilistic association mechanism and a heterogeneous Conditional Random Field (CRF) clustering approach combining semantic, spatial and motion information to jointly infer cluster segmentations online for every frame. The poses of camera and dynamic objects are instantly solved through a sliding-window optimization. Our system is evaluated on Oxford Multimotion and KITTI dataset both quantitatively and qualitatively, reaching comparable results to state-of-the-art solutions on both odometry and dynamic trajectory recovery.

[1]  Shaojie Shen,et al.  Tracking 3-D Motion of Dynamic Objects Using Monocular Visual-Inertial Sensing , 2019, IEEE Transactions on Robotics.

[2]  Andrea Tagliasacchi,et al.  Robust Articulated-ICP for Real-Time Hand Tracking , 2015 .

[3]  Julius Ziegler,et al.  StereoScan: Dense 3d reconstruction in real-time , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[4]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jason J. Corso,et al.  Spatiotemporal Articulated Models for Dynamic SLAM , 2016, ArXiv.

[6]  Lourdes Agapito,et al.  Co-fusion: Real-time segmentation, tracking and fusion of multiple objects , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Timothy D. Barfoot,et al.  State Estimation for Robotics , 2017 .

[10]  Hugh F. Durrant-Whyte,et al.  Simultaneous Localization, Mapping and Moving Object Tracking , 2007, Int. J. Robotics Res..

[11]  Bastian Leibe,et al.  Track to Reconstruct and Reconstruct to Track , 2020, IEEE Robotics and Automation Letters.

[12]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[13]  Peter M. Roth,et al.  GP2C: Geometric Projection Parameter Consensus for Joint 3D Pose and Focal Length Estimation in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Sanja Fidler,et al.  3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Shaojie Shen,et al.  Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving , 2018, ECCV.

[16]  Ziqi Zhang,et al.  Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Qi Wei,et al.  DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Yu Zhang,et al.  RGB-D SLAM in Dynamic Environments Using Points Correlations , 2018, ArXiv.

[19]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[20]  Binbin Xu,et al.  MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[21]  Shi-Min Hu,et al.  ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Jonathan D. Gammell,et al.  Occlusion-Robust MVO: Multimotion Estimation Through Occlusion Via Motion Closure , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jong-Hwan Kim,et al.  Effective Background Model-Based RGB-D Dense Visual Odometry in a Dynamic Environment , 2016, IEEE Transactions on Robotics.

[25]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[27]  Odest Chadwicke Jenkins,et al.  Robust graph SLAM in dynamic environments with moving landmarks , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Raquel Urtasun,et al.  Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation , 2014, ECCV.

[29]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[30]  Jörg Stückler,et al.  EM-Fusion: Dynamic Object-Level SLAM With Probabilistic Data Association , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Shaojie Shen,et al.  Stereo R-CNN Based 3D Object Detection for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  K. Madhava Krishna,et al.  Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Shichao Yang,et al.  CubeSLAM: Monocular 3-D Object SLAM , 2018, IEEE Transactions on Robotics.

[34]  Paul Newman,et al.  Multimotion Visual Odometry (MVO): Simultaneous Estimation of Camera and Third-Party Motions , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[36]  Michael Felsberg,et al.  Unveiling the Power of Deep Tracking , 2018, ECCV.

[37]  Guoquan Huang,et al.  Tightly-Coupled Visual-Inertial Localization and 3-D Rigid-Body Target Tracking , 2019, IEEE Robotics and Automation Letters.

[38]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[39]  Marc Pollefeys,et al.  Robust Dense Mapping for Large-Scale Dynamic Environments , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Tim Weyrich,et al.  Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion , 2013, 2013 International Conference on 3D Vision.

[41]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Andreas Geiger,et al.  Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Federico Tombari,et al.  Real-time and scalable incremental segmentation on dense SLAM , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[44]  Jonathan D. Gammell,et al.  The Oxford Multimotion Dataset: Multiple SE(3) Motions With Ground Truth , 2019, IEEE Robotics and Automation Letters.

[45]  Sergio Caccamo,et al.  Joint 3D Reconstruction of a Static Scene and Moving Objects , 2017, 2017 International Conference on 3D Vision (3DV).

[46]  Javier Civera,et al.  DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes , 2018, IEEE Robotics and Automation Letters.

[47]  K. Madhava Krishna,et al.  Multi-object Monocular SLAM for Dynamic Environments , 2020, 2020 IEEE Intelligent Vehicles Symposium (IV).

[48]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[49]  Dimitrios Tzionas,et al.  Reconstructing Articulated Rigged Models from RGB-D Videos , 2016, ECCV Workshops.

[50]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.