Fusing Intertial Data with Vision for Enhanced Image Understanding

In this paper we show that combining knowledge of the orientation of a camera with visual information can be used to improve the performance of semantic image segmentation. This is based on the assumption that the direction in which a camera is facing acts as a prior on the content of the images it creates. We gathered egocentric video with a camera attached to a head-mounted display, and recorded its orientation using an inertial sensor. By combining orientation information with typical image descriptors, we show that segmentation of individual images improves in accuracy compared with vision alone, from 61 % to 71 % over six classes. We also show that this method can be applied to both point and line based features from the image, and that these can be combined together for further benefits. Our resulting system would have applications in autonomous robot locomotion and guiding visually impaired humans.

[1]  E. Virre Virtual reality and the vestibular apparatus , 1996 .

[2]  Aftab E. Patla,et al.  Review article Understanding the roles of vision in the control of human locomotion , 1997 .

[3]  T. Brandt,et al.  Reciprocal inhibitory visual-vestibular interaction. Visual motion stimulation deactivates the parieto-insular vestibular cortex. , 1998, Brain : a journal of neurology.

[4]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[5]  J. Denk,et al.  Experiments in vision-guided biped walking , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Avinash C. Kak,et al.  Vision for Mobile Robot Navigation: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Carl A. Moore,et al.  TERRAIN ESTIMATION USING INTERNAL SENSORS , 2004 .

[8]  P. Vidal,et al.  Postural and locomotor control in normal and vestibularly deficient mice , 2004, The Journal of physiology.

[9]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[10]  A. Patla,et al.  Visual–vestibular interaction during goal directed locomotion: effects of aging and blurring vision , 2006, Experimental Brain Research.

[11]  Sebastian Thrun,et al.  Self-supervised Monocular Road Detection in Desert Terrain , 2006, Robotics: Science and Systems.

[12]  Salah Sukkarieh,et al.  Inertial Aiding of Inverse Depth SLAM using a Monocular Camera , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[13]  Alexander Kleiner,et al.  Real‐time localization and elevation mapping within urban search and rescue scenarios , 2007, J. Field Robotics.

[14]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[15]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[16]  Larry H. Matthies,et al.  Two years of Visual Odometry on the Mars Exploration Rovers , 2007, J. Field Robotics.

[17]  D. Angelaki,et al.  Vestibular system: the many facets of a multimodal sense. , 2008, Annual review of neuroscience.

[18]  Yaxin Bi,et al.  The combination of multiple classifiers using an evidential reasoning approach , 2008, Artif. Intell..

[19]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Rafael Grompone von Gioi,et al.  LSD: A Fast Line Segment Detector with a False Detection Control , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Anton Osokin,et al.  Fast Approximate Energy Minimization with Label Costs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  S. B. Kang,et al.  Image deblurring using inertial measurement sensors , 2010, SIGGRAPH 2010.

[23]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[24]  Roland Siegwart,et al.  Fusion of IMU and Vision for Absolute Scale Estimation in Monocular SLAM , 2011, J. Intell. Robotic Syst..

[25]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[26]  Ruxandra Tapu,et al.  A computer vision system that ensure the autonomous navigation of blind people , 2013, 2013 E-Health and Bioengineering Conference (EHB).

[27]  Justin Domke,et al.  Learning Graphical Model Parameters with Approximate Marginal Inference , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Laurent D. Cohen,et al.  Combination of Piecewise-Geodesic Paths for Interactive Segmentation , 2014, International Journal of Computer Vision.

[29]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[30]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[31]  Andrew Calway,et al.  Recognising Planes in a Single Image , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  David R. Bull,et al.  Using Inertial Data to Enhance Image Segmentation - Knowing Camera Orientation Can Improve Segmentation of Outdoor Scenes , 2015, VISAPP.