Pitch and Roll Camera Orientation from a Single 2D Image Using Convolutional Neural Networks

In this paper, we propose using convolutional neural networks (CNNs) to automatically determine the pitch and roll of a camera using a single, scene agnostic, 2D image. We compared a linear regressor, a two-layer neural network, and two CNNs. We show the CNNs produce high levels of accuracy in estimating the ground truth orientations which can be used in various computer vision tasks where calculating the camera orientation is necessary or useful. By utilizing accelerometer data in an existing image dataset, we were able to provide the large camera orientation ground truth dataset needed to train such a network with approximately correct values. The trained network is then fine-tuned to smaller datasets with exact camera orientation labels. Additionally, the network is fine-tuned to a dataset with different intrinsic camera parameters to demonstrate the transferability of the network.

[1]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[2]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, ICCV Workshops.

[3]  Thomas Brox,et al.  Image Orientation Estimation with Convolutional Networks , 2015, GCPR.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Roberto Cipolla,et al.  Convolutional networks for real-time 6-DOF camera relocalization , 2015, ArXiv.

[6]  Jean-Philippe Tardif,et al.  Non-iterative approach for fast and accurate vanishing point detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Masahiro Tomono,et al.  3-D Localization and Mapping Using a Single Camera Based on Structure-from-Motion with Automatic Baseline Selection , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[11]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[12]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[14]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ali Borji Vanishing point detection with convolutional neural networks , 2016, ArXiv.