Predicting Relative Depth between Objects from Semantic Features

Vision and language tasks such as Visual Relation Detection and Visual Question Answering benefit from semantic features that afford proper grounding of language. The 3D depth of objects depicted in 2D images is one such feature. However it is very difficult to obtain accurate depth information without learning the appropriate features, which are scene dependent. The state of the art in this area are complex Neural Network models trained on stereo image data to predict depth per pixel. Fortunately, in some tasks, its only the relative depth between objects that is required. In this paper the extent to which semantic features can predict course relative depth is investigated. The problem is casted as a classification one and geometrical features based on object bounding boxes, object labels and scene attributes are computed and used as inputs to pattern recognition models to predict relative depth. i.e behind, in-front and neutral. The results are compared to those obtained from averaging the output of the monodepth neural network model, which represents the state-of-theart. An overall increase of 14% in relative depth accuracy over relative depth computed from the monodepth model derived results is achieved.

[1]  Jitendra Malik,et al.  Computing Local Surface Orientation and Shape from Texture for Curved Surfaces , 1997, International Journal of Computer Vision.

[2]  Xiao Lin,et al.  Depth Estimation and Semantic Segmentation from a Single RGB Image Using a Hybrid Convolutional Neural Network , 2019, Sensors.

[3]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Yann LeCun,et al.  Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Olaf Hellwich,et al.  Object Depth Estimation from a Single Image Using Fully Convolutional Neural Network , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[8]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[10]  Jan-Michael Frahm,et al.  Repetition-based dense single-view reconstruction , 2011, CVPR 2011.

[11]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[12]  Martial Hebert,et al.  Unfolding an Indoor Origami World , 2014, ECCV.

[13]  David Sweeney,et al.  Learning to be a depth camera for close-range human capture and interaction , 2014, ACM Trans. Graph..

[14]  Andrew Owens,et al.  Discrete-continuous optimization for large-scale structure from motion , 2011, CVPR 2011.

[15]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[16]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[17]  Adrian Muscat,et al.  SpatialVOC2K: A Multilingual Dataset of Images with Annotations and Features for Spatial Relations between Objects , 2018, INLG.

[18]  Francesc Moreno-Noguer,et al.  Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions , 2015, EMNLP.

[19]  Andrew Blake,et al.  Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Mingyi He,et al.  Single image depth estimation by dilated deep residual convolutional neural network and soft-weight-sum inference , 2017, ArXiv.

[21]  Adrian Muscat,et al.  Clustering-based Model for Predicting Multi-spatial Relations in Images , 2019, ICINCO.

[22]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[27]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Atsuto Maki,et al.  Geotensity: Combining Motion and Lighting for 3D Surface Reconstruction , 2004, International Journal of Computer Vision.

[34]  Task-relevant perceptual features can define categories in visual memory too , 2017, Memory & cognition.

[35]  S. Dehaene,et al.  The Number Sense: How the Mind Creates Mathematics. , 1998 .

[36]  Adrian Muscat,et al.  Adding the Third Dimension to Spatial Relation Detection in 2D Images , 2018, INLG.

[37]  Tony Lindeberg,et al.  Shape from texture from a multi-scale perspective , 1993, 1993 (4th) International Conference on Computer Vision.

[38]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.