DRAM or no-DRAM? Exploring linear solver architectures for image domain warping in 28 nm CMOS

Solving large optimization problems within the energy and cost budget of mobile SoCs in real-time is a challenging task and motivates the development of specialized hardware accelerators. We present an evaluation of different linear solvers suitable for least-squares problems emanating from image processing applications such as image domain warping. In particular, we estimate implementation costs in 28 nm CMOS technology, with focus on trading on-chip memory vs. off-chip (DRAM) bandwidth. Our assessment shows large differences in circuit area, throughput and energy consumption and aims at providing a recommendation for selecting a suitable architecture. Our results emphasize that DRAM-free accelerators are an attractive choice in terms of power consumption and overall system complexity, even though they require more logic silicon area when compared to accelerators that make use of external DRAM.

[1]  Tobias G. Noll,et al.  Cross-layer optimization of QRD accelerators , 2013, 2013 Proceedings of the ESSCIRC (ESSCIRC).

[2]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[3]  Roman Wyrzykowski,et al.  Parallel Implementation of Cholesky LLT-Algorithm in FPGA-Based Processor , 2007, PPAM.

[4]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[6]  Younglok Kim,et al.  Efficient implementation of linear system solution block using LDLT factorization , 2008, 2008 International SoC Design Conference.

[7]  Aljoscha Smolic,et al.  Evaluation and FPGA Implementation of Sparse Linear Solvers for Video Processing Applications , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Aljoscha Smolic,et al.  Automatic View Synthesis by Image-Domain-Warping , 2013, IEEE Transactions on Image Processing.

[9]  Åke Björck,et al.  Numerical methods for least square problems , 1996 .

[10]  Gary L. Miller,et al.  Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing , 2011, Comput. Vis. Image Underst..

[11]  Husheng Li,et al.  Compressed sensing and Cholesky decomposition on FPGAs and GPUs , 2012, Parallel Comput..

[12]  Aljoscha Smolic,et al.  Nonlinear disparity mapping for stereoscopic 3D , 2010, ACM Trans. Graph..

[13]  Markus H. Gross,et al.  A system for retargeting of streaming video , 2009, ACM Trans. Graph..

[14]  Behzad Boroujerdian,et al.  LPDDR 2 Memory Controller Design in a 28 nm Process , 2012 .

[15]  Markus Gross,et al.  A system for retargeting of streaming video , 2009, SIGGRAPH 2009.

[16]  Richard Szeliski,et al.  Multigrid and multilevel preconditioners for computational photography , 2011, ACM Trans. Graph..

[17]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  George A. Constantinides,et al.  A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices , 2010, TRETS.

[19]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[20]  Luca Benini,et al.  An approximate computing technique for reducing the complexity of a direct-solver for sparse linear systems in real-time video processing , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Aljoscha Smolic,et al.  Nonlinear disparity mapping for stereoscopic 3D , 2010, SIGGRAPH 2010.

[22]  Markus Gross,et al.  Practical temporal consistency for image-based graphics applications , 2012, ACM Trans. Graph..

[23]  Gregory D. Peterson,et al.  High-Performance Mixed-Precision Linear Solver for FPGAs , 2008, IEEE Transactions on Computers.

[24]  Lingamneni Avinash,et al.  Ten Years of Building Broken Chips: The Physics and Engineering of Inexact Computing , 2013, TECS.