MobiDepth: real-time depth estimation using on-device dual cameras

Real-time depth estimation is critical for the increasingly popular augmented reality and virtual reality applications on mobile devices. Yet existing solutions are insufficient as they require expensive depth sensors or motion of the device, or have a high latency. We propose MobiDepth, a real-time depth estimation system using the widely-available on-device dual cameras. While binocular depth estimation is a mature technique, it is challenging to realize the technique on commodity mobile devices due to the different focal lengths and unsynchronized frame flows of the on-device dual cameras and the heavy stereo-matching algorithm. To address the challenges, MobiDepth integrates three novel techniques: 1) iterative field-of-view cropping, which crops the field-of-views of the dual cameras to achieve the equivalent focal lengths for accurate epipolar rectification; 2) heterogeneous camera synchronization, which synchronizes the frame flows captured by the dual cameras to avoid the displacement of moving objects across the frames in the same pair; 3) mobile GPU-friendly stereo matching, which effectively reduces the latency of stereo matching on a mobile GPU. We implement MobiDepth on multiple commodity mobile devices and conduct comprehensive evaluations. Experimental results show that MobiDepth achieves real-time depth estimation of 22 frames per second with a significantly reduced depth-estimation error compared with the baselines. Using MobiDepth, we further build an example application of 3D pose estimation, which significantly outperforms the state-of-the-art 3D pose-estimation method, reducing the pose-estimation latency and error by up to 57.1% and 29.5%, respectively.

[1]  Yunxin Liu,et al.  Romou: rapidly generate high-performance tensor kernels for mobile GPUs , 2022, MobiCom.

[2]  E. Carminati,et al.  Smartphone assisted fieldwork: Towards the digital transition of geoscience fieldwork using LiDAR-equipped iPhones , 2022, Earth-Science Reviews.

[3]  Zheng Yang,et al.  FollowUpAR: enabling follow-up effects in mobile AR applications , 2021, MobiSys.

[4]  Changick Kim,et al.  MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Changxin Gao,et al.  Lite-HRNet: A Lightweight High-Resolution Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pascal Monasse,et al.  The Polar Epipolar Rectification , 2021, Image Process. Line.

[7]  S. Izadi,et al.  HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  N. Shih,et al.  Situated AR Simulations of a Lantern Festival Using a Smartphone and LiDAR-Based 3D Models , 2020, Applied Sciences.

[9]  Ju Ren,et al.  MobiPose: real-time multi-person pose estimation on mobile devices , 2020, SenSys.

[10]  Youngki Lee,et al.  Heimdall: mobile GPU coordination platform for augmented reality applications , 2020, MobiCom.

[11]  Hujun Bao,et al.  Mobile3DRecon: Real-time Monocular 3D Reconstruction on a Mobile Phone , 2020, IEEE Transactions on Visualization and Computer Graphics.

[12]  Young Min Song,et al.  Miniaturized 3D Depth Sensing-Based Smartphone Light Field Camera , 2020, Sensors.

[13]  Nicolas Monet,et al.  Lightweight 3D Human Pose Estimation Network Training Using Teacher-Student Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Thomas S. Huang,et al.  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shuxue Quan,et al.  Occlusion and Collision Aware Smartphone AR Using Time-of-Flight Camera , 2019, ISVC.

[16]  Luigi di Stefano,et al.  Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Naai-Jung Shih,et al.  MARINS: A Mobile Smartphone AR System for Pathfinding in a Dark Environment , 2018, Sensors.

[18]  Shahram Izadi,et al.  StereoNet: Guided Hierarchical Refinement for Edge-Aware Depth Prediction , 2018 .

[19]  Marc Pollefeys,et al.  SGM-Nets: Semi-Global Matching with Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Torsten Sattler,et al.  Large-scale outdoor 3D reconstruction on a mobile device , 2017, Comput. Vis. Image Underst..

[21]  Thomas Pock,et al.  End-to-End Training of Hybrid CNN-CRF Models for Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Antonio M. López,et al.  Embedded Real-time Stereo Estimation via Semi-Global Matching on the GPU , 2016, ICCS.

[23]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Haidi Ibrahim,et al.  Literature Survey on Stereo Vision Disparity Map Algorithms , 2016, J. Sensors.

[25]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[27]  Raúl Rojas,et al.  Weighted Semi-Global Matching and Center-Symmetric Census Transform for Robust Driver Assistance , 2013, CAIP.

[28]  Dah-Jye Lee,et al.  Review of stereo vision algorithms and their suitability for resource-limited systems , 2013, Journal of Real-Time Image Processing.

[29]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Mario Cifrek,et al.  A brief introduction to OpenCV , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[31]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[32]  Peter Pirsch,et al.  Real-time semi-global matching disparity estimation on the GPU , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[33]  Xing Mei,et al.  On building an accurate stereo matching system on graphics hardware , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34]  Carsten Rother,et al.  PatchMatch Stereo - Stereo Matching with Slanted Support Windows , 2011, BMVC.

[35]  Kristian Ambrosch,et al.  Accurate hardware-based stereo vision , 2010, Comput. Vis. Image Underst..

[36]  Antonios Gasteratos,et al.  Stereo vision for robotic applications in the presence of non-ideal lighting conditions , 2010, Image Vis. Comput..

[37]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[38]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[39]  Heiko Hirschmüller,et al.  Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Mauro Barbieri,et al.  Synchronization of multi-camera video recordings based on audio , 2007, ACM Multimedia.

[41]  Andreas Klaus,et al.  Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[42]  G. LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[43]  Vladimir Kolmogorov,et al.  Computing visual correspondence with occlusions using graph cuts , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[44]  Wolfgang Maass,et al.  On the Computational Power of Winner-Take-All , 2000, Neural Computation.

[45]  Zhengyou Zhang,et al.  Flexible camera calibration by viewing a plane from unknown orientations , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[46]  Ken Arnold,et al.  The Java Programming Language , 1996 .

[47]  Shih,et al.  Optical imaging by means of two-photon quantum entanglement. , 1995, Physical review. A, Atomic, molecular, and optical physics.

[48]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..