论文信息 - Performance Comparison of CGRA and Mobile GPU for Light-Field Image Processing

Performance Comparison of CGRA and Mobile GPU for Light-Field Image Processing

Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations generate enormous short-burst memory transfers in the cache system, cost high latency, and do not fully utilize the high memory bandwidth of GPU. Therefore, an alternative architecture that exploits a long-burst data transmission, which improves the memory bandwidth utilization, is essential. We propose a sparse stencil oriented Coarse Grain Reconfigurable Accelerator (CGRA) that we call EMAXV. Unlike on-demand multiple data loading on GPU, EMAXV loads the input data with a long burst transferring before the execution proceeds to conceal the sparse memory access and multi-threading cache races. It further obscures the memory loading latency with an execution latency from different activations. We evaluated the EMAXV and mobile GPU (Tegra K1) performances with identical host CPU's frequency and main memory bandwidth. Although EMAXV has much lower computation capability, we achieved four times performance of mobile GPU for light-field depth extraction and 89% of the performance for light-field image rendering.

Yasuhiko Nakashima | Yuttakon Yuttakonkit

[1] Andrew Lumsdaine,et al. Plenoptic rendering with interactive performance using GPUs , 2012, Electronic Imaging.

[2] Jun Yao,et al. Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators , 2014, 2014 Second International Symposium on Computing and Networking.

[3] Ригли,et al. Memory benchmarking characterisation of ARM-based SoCs , 2015 .

[4] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5] Satoru Yamamoto,et al. Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation , 2010, CARN.

[6] P. Hanrahan,et al. Light Field Photography with a Hand-held Plenoptic Camera , 2005 .

[7] Takashi Aoki,et al. FPGA Implementation of Exclusive Block Matching for Robust Moving Object Extraction and Tracking , 2014, IEICE Trans. Inf. Syst..

[8] Yi-Chang Lu,et al. A pixel-based depth estimation algorithm and its hardware implementation for 4-D light field data , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[9] Nathan Ickes,et al. Reconfigurable processor for energy-scalable computational photography , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.