Depth from motion for smartphone AR

Augmented reality (AR) for smartphones has matured from a technology for earlier adopters, available only on select high-end phones, to one that is truly available to the general public. One of the key breakthroughs has been in low-compute methods for six degree of freedom (6DoF) tracking on phones using only the existing hardware (camera and inertial sensors). 6DoF tracking is the cornerstone of smartphone AR allowing virtual content to be precisely locked on top of the real world. However, to really give users the impression of believable AR, one requires mobile depth. Without depth, even simple effects such as a virtual object being correctly occluded by the real-world is impossible. However, requiring a mobile depth sensor would severely restrict the access to such features. In this article, we provide a novel pipeline for mobile depth that supports a wide array of mobile phones, and uses only the existing monocular color sensor. Through several technical contributions, we provide the ability to compute low latency dense depth maps using only a single CPU core of a wide range of (medium-high) mobile phones. We demonstrate the capabilities of our approach on high-level AR applications including real-time navigation and shopping.

[1]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[2]  Ian D. Reid,et al.  Just-in-Time Reconstruction: Inpainting Sparse Maps Using Single View Depth Predictors as Priors , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[4]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[5]  Marc Pollefeys,et al.  Turning Mobile Phones into 3D Scanners , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[7]  Olivier D. Faugeras,et al.  The geometry of multiple images - the laws that govern the formation of multiple images of a scene and some of their applications , 2001 .

[8]  Shahram Izadi,et al.  UltraStereo: Efficient Learning-Based Matching for Active Stereo Systems , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Richard Szeliski,et al.  Manhattan-world stereo , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  Yair Movshovitz-Attias,et al.  Synthetic depth-of-field with a single-camera mobile phone , 2018, ACM Trans. Graph..

[13]  Qiong Yan,et al.  Cascade Residual Learning: A Two-Stage Convolutional Neural Network for Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shahram Izadi,et al.  StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction , 2018, ECCV.

[16]  Ramesh C. Jain,et al.  Motion Stereo Using Ego-Motion Complex Logarithmic Mapping , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Takeshi Naemura,et al.  Continuous 3D Label Stereo Matching Using Local Expansion Moves , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Marc Pollefeys,et al.  Real-Time View Correction for Mobile Devices , 2017, IEEE Transactions on Visualization and Computer Graphics.

[19]  Andrew W. Fitzgibbon,et al.  Global stereo reconstruction under second order smoothness priors , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jack Tumblin,et al.  The Trilateral Filter for High Contrast Images and Meshes , 2003, Rendering Techniques.

[21]  Olaf Kähler,et al.  Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices , 2015, IEEE Transactions on Visualization and Computer Graphics.

[22]  Sing Bing Kang,et al.  Depth Transfer: Depth Extraction from Videos Using Nonparametric Sampling , 2016 .

[23]  J. M. P. van Waveren,et al.  The asynchronous time warp for virtual reality on consumer hardware , 2016, VRST.

[24]  Hendrik P. A. Lensch,et al.  Scale Robust Multi View Stereo , 2012, ECCV.

[25]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[26]  Yinda Zhang,et al.  Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Martin Buss,et al.  Comparison of surface normal estimation methods for range sensing applications , 2009, 2009 IEEE International Conference on Robotics and Automation.

[28]  Reinhard Koch,et al.  A simple and efficient rectification method for general motion , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[29]  Steven M. Seitz,et al.  Depth from focus with your mobile phone , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[31]  Torsten Sattler,et al.  Large-scale outdoor 3D reconstruction on a mobile device , 2017, Comput. Vis. Image Underst..

[32]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[33]  Mark R. Mine,et al.  Just-In-Time Pixels , 1995 .

[34]  Shahram Izadi,et al.  StereoNet: Guided Hierarchical Refinement for Edge-Aware Depth Prediction , 2018 .

[35]  Jonathan T. Barron,et al.  The Fast Bilateral Solver , 2015, ECCV.

[36]  Adam Finkelstein,et al.  PatchMatch: a randomized correspondence algorithm for structural image editing , 2009, SIGGRAPH 2009.

[37]  Alexandru Tupan,et al.  Triangulation , 1997, Comput. Vis. Image Underst..

[38]  Hongyang Chao,et al.  As-Rigid-As-Possible Stereo under Second Order Smoothness Priors , 2014, ECCV.

[39]  Neil A. Dodgson,et al.  Real-Time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid , 2010, ECCV.

[40]  Dandan Zhang,et al.  Single-trial ERPs elicited by visual stimuli at two contrast levels: Analysis of ongoing EEG and latency/amplitude jitters , 2012, 2012 IEEE Symposium on Robotics and Applications (ISRA).

[41]  R. Hartley Triangulation, Computer Vision and Image Understanding , 1997 .

[42]  Shahram Izadi,et al.  MonoFusion: Real-time 3D reconstruction of small scenes with a single web camera , 2013, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[43]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Charles T. Loop,et al.  Computing rectifying homographies for stereo vision , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[46]  Jonathan T. Barron,et al.  A hardware-friendly bilateral solver for real-time virtual reality video , 2017, High Performance Graphics.

[47]  Haidi Ibrahim,et al.  Literature Survey on Stereo Vision Disparity Map Algorithms , 2016, J. Sensors.

[48]  Tony F. Chan,et al.  Mathematical Models for Local Nontexture Inpaintings , 2002, SIAM J. Appl. Math..

[49]  Marc Pollefeys,et al.  Live Metric 3D Reconstruction on Mobile Phones , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Heiko Hirschmüller,et al.  Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Ramakant Nevatia,et al.  Depth measurement by motion stereo , 1976 .

[52]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[53]  Fatih Murat Porikli,et al.  Depth Map Completion by Jointly Exploiting Blurry Color Images and Sparse Depth Maps , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[54]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[56]  Shahram Izadi,et al.  Low Compute and Fully Parallel Computer Vision with HashMatch , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Daniel Cremers,et al.  Semi-dense visual odometry for AR on a smartphone , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[58]  Jan-Michael Frahm,et al.  PatchMatch Based Joint View Selection and Depthmap Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Pushmeet Kohli,et al.  MobileFusion: Real-Time Volumetric Surface Reconstruction and Dense Tracking on Mobile Phones , 2015, IEEE Transactions on Visualization and Computer Graphics.

[60]  Margrit Gelautz,et al.  Temporally Consistent Disparity and Optical Flow via Efficient Spatio-temporal Filtering , 2011, PSIVT.

[61]  Hans-Peter Seidel,et al.  Coherent Spatiotemporal Filtering, Upsampling and Rendering of RGBZ Videos , 2012, Comput. Graph. Forum.

[62]  Carsten Rother,et al.  PatchMatch Stereo - Stereo Matching with Slanted Support Windows , 2011, BMVC.

[63]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[64]  Jonathan T. Barron,et al.  Fast bilateral-space stereo for synthetic defocus , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Michael S. Brown,et al.  High-Quality Depth Map Upsampling and Completion for RGB-D Cameras , 2014, IEEE Transactions on Image Processing.

[66]  Jonathan T. Barron,et al.  Jump: virtual reality video , 2016, ACM Trans. Graph..

[67]  Ping Li,et al.  On Creating Depth Maps from Monoscopic Video using Structure from Motion , 2006 .

[68]  Michael Mara,et al.  Extended TimeWarp latency compensation for virtual reality , 2016, I3D.

[69]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[70]  Caihua Wang,et al.  Comparison of local plane fitting methods for range data , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[71]  Tim Weyrich,et al.  Capturing Time-of-Flight data with confidence , 2011, CVPR 2011.

[72]  Manuel Menezes de Oliveira Neto,et al.  Fast Digital Image Inpainting , 2001, VIIP.

[73]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[74]  Reinhard Männer,et al.  Calculating Dense Disparity Maps from Color Stereo Images, an Efficient Implementation , 2004, International Journal of Computer Vision.