MSMD-Net: Deep Stereo Matching with Multi-scale and Multi-dimension Cost Volume

Deep end-to-end learning based stereo matching methods have achieved great success as witnessed by the leaderboards across different benchmarking datasets (KITTI, Middlebury, ETH3D, etc), where the cost volume representation is an indispensable step to the success. However, most existing work only employs a single cost volume, which cannot fully exploit the multi-scale cues in stereo matching and provide guidance for disparity refinement. What's more, the single cost volume representation also limits the disparity range and the resolution of the disparity estimation. In this paper, we propose MSMD-Net (Multi-Scale and Multi-Dimension) to construct multi-scale and multi-dimension cost volume. At the multi-scale level, we generate four 4D combination volumes at different scales and integrate them in 3D cost aggregation to predict an initial disparity estimation. At the multi-dimension level, we construct a 3D warped correlation volume and use it to refine the initial disparity map with residual learning. These two dimensional cost volumes are complementary to each other and can boost the performance of disparity estimation. Additionally, we propose a switch training strategy to further improve the accuracy of disparity estimation, where we switch two kinds of different activation functions to alleviate the overfitting issue in the pre-training process. Our proposed method was evaluated on several benchmark datasets and ranked first on KITTI 2012 leaderboard and second on KITTI 2015 leaderboard as of June 23.The code of MSMD-Net is available at this https URL.

[1]  Diganta Misra Mish: A Self Regularized Non-Monotonic Activation Function , 2020, BMVC.

[2]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  M. Veloso,et al.  Depth Camera based Localization and Navigation for Indoor Mobile Robots , 2011 .

[4]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[6]  Quoc V. Le,et al.  Swish: a Self-Gated Activation Function , 2017, 1710.05941.

[7]  Diganta Misra,et al.  Mish: A Self Regularized Non-Monotonic Neural Activation Function , 2019, ArXiv.

[8]  Jun Zhou,et al.  Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching , 2020, AAAI.

[9]  Hao Su,et al.  Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Wei Mao,et al.  Cost Volume Pyramid Based Depth Inference for Multi-View Stereo , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[12]  Xiaogang Wang,et al.  Group-Wise Correlation Stereo Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Arnaud Doucet,et al.  On the Selection of Initialization and Activation Function for Deep Neural Networks , 2018, ArXiv.

[15]  Rui Hu,et al.  DeepPruner: Learning Efficient Stereo Matching via Differentiable PatchMatch , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Stefano Mattoccia,et al.  Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation , 2020, ECCV.

[17]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Luigi di Stefano,et al.  Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qiong Yan,et al.  Cascade Residual Learning: A Two-Stage Convolutional Neural Network for Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[20]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[23]  Siyu Zhu,et al.  Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[26]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Michael Happold,et al.  Hierarchical Deep Stereo Matching on High-Resolution Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiaogang Wang,et al.  Unsupervised Cross-spectral Stereo Matching by Learning to Synthesize , 2019, AAAI.

[30]  Ruigang Yang,et al.  Domain-invariant Stereo Matching Networks , 2019, ECCV.

[31]  Lili Ju,et al.  Semantic Stereo Matching With Pyramid Cost Volumes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Ming-Ming Cheng,et al.  Multi-Level Context Ultra-Aggregation for Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[34]  Ruigang Yang,et al.  Learning Depth with Convolutional Spatial Propagation Network , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wei Chen,et al.  Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[38]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[39]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.