论文信息 - BAX: A Bundle Adjustment Accelerator With Decoupled Access/Execute Architecture for Visual Odometry

BAX: A Bundle Adjustment Accelerator With Decoupled Access/Execute Architecture for Visual Odometry

As the demand for embedded-vision grows, solving large optimization problems in real-time with energy and cost budget is a challenge. We present BAX, a hardware accelerator of bundle adjustment (BA), which solves the least-squares problem of state estimation in visual odometry (VO). BAX consists of a frontend and a backend for control and computation, respectively. The frontend generates instructions on-the-fly executed at the backend to perform the BA algorithm. The backend adopts decoupled access/execute (DAE) architecture, which separates the memory access unit (MAU) from the pipeline. The MAU can prefetch vectors and matrices ahead of computations. To further reduce the latency of data reorganization, three transpose-free dataflows are proposed for matrix multiplication operations on the vector processing unit (VPU). Besides, a unified architecture for both forward and backward substitution is designed for matrix decomposition in the linear solver. All the data are stored in 442kB on-chip memory, and the local map is maintained efficiently by the hierarchical graph memory. Compared with the baseline architecture, the processing time is reduced by 53.9% through the above techniques. BAX is implemented in 32-bit floating-point precision with data normalization on FPGA. It completes a full BA in about 63.44ms at 200MHz, consuming 1.12W power. BAX is $1.73\times $ and $22.38\times $ faster than the desktop and embedded CPUs, respectively, and achieves 90% performance of the GPU at much less power consumption.

[1] Hauke Strasdat,et al. Real-time monocular SLAM: Why filter? , 2010, 2010 IEEE International Conference on Robotics and Automation.

[2] Shuming Chen,et al. FT-Matrix: A Coordination-Aware Architecture for Signal Processing , 2014, IEEE Micro.

[3] Maoteng Zheng,et al. A New GPU Bundle Adjustment Method for Large-Scale Data , 2017 .

[4] Jan-Michael Frahm,et al. Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Olaf Hellwich,et al. OpenOF - Framework for Sparse Non-linear Least Squares Optimization on a GPU , 2013, VISAPP.

[6] Richard Szeliski,et al. Bundle Adjustment in the Large , 2010, ECCV.

[7] Lizy Kurian John,et al. Memory Latency Effects in Decoupled Architectures , 1994, IEEE Trans. Computers.

[8] P. J. Narayanan,et al. Practical Time Bundle Adjustment for 3D Reconstruction on the GPU , 2010, ECCV Workshops.

[9] Friedrich Fraundorfer,et al. Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[10] Qiang Liu,et al. π-BA: Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[11] Andrew W. Fitzgibbon,et al. Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[12] Manolis I. A. Lourakis,et al. SBA: A software package for generic sparse bundle adjustment , 2009, TOMS.

[13] F. Fraundorfer,et al. Visual Odometry : Part II: Matching, Robustness, Optimization, and Applications , 2012, IEEE Robotics & Automation Magazine.

[14] Husheng Li,et al. Compressed sensing and Cholesky decomposition on FPGAs and GPUs , 2012, Parallel Comput..

[15] Steven M. Seitz,et al. Multicore bundle adjustment , 2011, CVPR 2011.

[16] Luca Carlone,et al. Navion: A 2-mW Fully Integrated Real-Time Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones , 2018, IEEE Journal of Solid-State Circuits.

[17] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.

[18] Scott A. Mahlke,et al. AnySP: Anytime Anywhere Anyway Signal Processing , 2009, IEEE Micro.

[19] David Blaauw,et al. An 879GOPS 243mW 80fps VGA Fully Visual CNN-SLAM Processor for Wide-Range Autonomous Exploration , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[20] Gary R. Bradski,et al. ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[21] Johan Ernest Mebius. Derivation of the Euler-Rodrigues formula for three-dimensional rotations from the general formula for four-dimensional rotations , 2007 .

[22] Hyo-Eun Kim,et al. A graphics and vision unified processor with 0.89µW/fps pose estimation engine for augmented reality , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[23] Guoquan Huang,et al. Visual-Inertial Navigation: A Concise Review , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[24] Andrew R. Pleszkun,et al. PIPE: a VLSI decoupled architecture , 1985, ISCA '85.

[25] Youchang Kim,et al. A 27 mW Reconfigurable Marker-Less Logarithmic Camera Pose Estimation Engine for Mobile Augmented Reality Processor , 2015, IEEE Journal of Solid-State Circuits.

[26] Kurt Konolige,et al. g 2 o: A general Framework for (Hyper) Graph Optimization , 2011 .

[27] V. Lepetit,et al. EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.