Multitask Deep Neural Networks for Tele-Wide Stereo Matching

In this article, we propose deep learning solutions for the estimation of the real world depth of elements in a scene captured by two cameras with different field of views. We consider a realistic smart-phone scenario, where the first field of view (FOV) is a wide FOV with $1 \times $ the optical zoom, and the second FOV is contained in the first FOV captured by a tele zoom lens with $2 \times $ the optical zoom. We refer to the problem of estimating the depth for all elements in the union of the FOVs which corresponds to the Wide FOV as ‘tele-wide stereo matching’. Traditional approaches can only estimate the disparity or depth in the overlapped FOV, corresponding to the Tele FOV, using stereo matching algorithms. To benchmark this novel problem, we introduce a single-image inverse-depth estimation (SIDE) solution to estimate the disparity from the image corresponding to the union Wide FOV only. We also design a novel multitask tele-wide stereo matching deep neural network (MT-TW-SMNet), which is the first to combine the stereo matching and the single image depth tasks in one network. Moreover, we propose multiple methods for the fusion between the above networks. For example, we have input feature fusion, that utilizes the disparity estimated by stereo-matching as an additional input feature for SIDE. We also designed networks for decision fusion, that deploys a stacked hour glass (SHG) network for fusion and refinement of the disparity maps from both the SIDE network and the MT-TW-SMNet. These fusion schemes significantly improve the accuracy. Experimental results on KITTI and SceneFlow datasets demonstrate that our proposed approaches provide a reasonable solution to the tele-wide stereo matching problem. We demonstrate the effectiveness of our solutions in generating the Bokeh effect on the full Wide FOV.

[1]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Mingyi He,et al.  Single image depth estimation by dilated deep residual convolutional neural network and soft-weight-sum inference , 2017, ArXiv.

[5]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[6]  Takeo Kanade,et al.  Development of a video-rate stereo machine , 1995, Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human Robot Interaction and Cooperative Robots.

[7]  Carsten Rother,et al.  Fast cost-volume filtering for visual correspondence and beyond , 2011, CVPR 2011.

[8]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[9]  Jun Li,et al.  A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jungwon Lee,et al.  Multi-Task Learning of Depth from Tele and Wide Stereo Image Pairs , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[12]  Lior Wolf,et al.  Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Didier Stricker,et al.  Combining Stereo Disparity and Optical Flow for Basic Scene Flow , 2018, ArXiv.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Tim McGraw,et al.  Fast Bokeh effects using low-rank linear filters , 2015, The Visual Computer.

[22]  Honglak Lee,et al.  A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Qiong Yan,et al.  Cascade Residual Learning: A Two-Stage Convolutional Neural Network for Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Christian Theobalt,et al.  Dense Wide-Baseline Scene Flow from Two Handheld Video Cameras , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[26]  Minh N. Do,et al.  Fast Global Image Smoothing Based on Weighted Least Squares , 2014, IEEE Transactions on Image Processing.

[27]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[29]  Jungwon Lee,et al.  FBA-AMNET: Foreground-Background Aware Atrous Multiscale Networks for Stereo Disparity Estimation , 2020, 2020 IEEE International Conference on Consumer Electronics (ICCE).

[30]  Jana Kosecka,et al.  Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[31]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[32]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[34]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[35]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[36]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[38]  Ruigang Yang,et al.  Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[39]  Heiko Hirschmüller,et al.  Evaluation of Stereo Matching Costs on Images with Radiometric Differences , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[41]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jungwon Lee,et al.  AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks , 2019, ArXiv.

[43]  Heiko Hirschmüller,et al.  Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[46]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[47]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jun Zhou,et al.  Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching , 2020, AAAI.

[49]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Jungwon Lee,et al.  Image Super Resolution Based on Fusing Multiple Convolution Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Ludmila I. Kuncheva,et al.  A Theoretical Study on Six Classifier Fusion Strategies , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Alistair Sutherland,et al.  Disparity Estimation by Simultaneous Edge Drawing , 2016, ACCV Workshops.

[54]  Jungwon Lee,et al.  Fused DNN: A Deep Neural Network Fusion Approach to Fast and Robust Pedestrian Detection , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..