Real-Time Pose Estimation for Event Cameras with Stacked Spatial LSTM Networks

We present a new method to estimate the 6DOF pose of the event camera solely based on the event stream. Our method first creates the event image from a list of events that occurs in a very short time interval, then a Stacked Spatial LSTM Network (SP-LSTM) is used to learn and estimate the camera pose. Our SP-LSTM comprises a CNN to learn deep features from the event images and a stack of LSTM to learn spatial dependencies in the image features space. We show that the spatial dependency plays an important role in the pose estimation task and the SP-LSTM can effectively learn that information. The experimental results on the public dataset show that our approach outperforms recent methods by a substantial margin. Overall, our proposed method reduces about 6 times the position error and 3 times the orientation error over the state of the art. The source code and trained models will be released.

[1]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Davide Scaramuzza,et al.  Event-based, 6-DOF pose tracking for high-speed maneuvers , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tobi Delbrück,et al.  A pencil balancing robot using a pair of AER dynamic vision sensors , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Davide Scaramuzza,et al.  Continuous-Time Visual-Inertial Trajectory Estimation with Event Cameras , 2017, ArXiv.

[9]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[10]  Matthias Bethge,et al.  Generative Image Modeling Using Spatial LSTMs , 2015, NIPS.

[11]  Lei Wang,et al.  Convolutional Recurrent Neural Networks for Text Classification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[12]  Tobi Delbruck,et al.  A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor , 2014, IEEE Journal of Solid-State Circuits.

[13]  Davide Scaramuzza,et al.  Low-latency event-based visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Thomas Pock,et al.  Real-time panoramic tracking for event cameras , 2017, 2017 IEEE International Conference on Computational Photography (ICCP).

[15]  Davide Scaramuzza,et al.  Accurate Angular Velocity Estimation With an Event Camera , 2017, IEEE Robotics and Automation Letters.

[16]  Daniel Cremers,et al.  Event-based 3D SLAM with a depth-augmented dynamic vision sensor , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Ryad Benosman,et al.  Simultaneous Mosaicing and Tracking with an Event Camera , 2014, BMVC.

[19]  Tobi Delbrück,et al.  The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM , 2016, Int. J. Robotics Res..

[20]  Daniel Cremers,et al.  Image-based Localization with Spatial LSTMs , 2016, ArXiv.