Performance Comparison of CGRA and Mobile GPU for Light-Field Image Processing

Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations generate enormous short-burst memory transfers in the cache system, cost high latency, and do not fully utilize the high memory bandwidth of GPU. Therefore, an alternative architecture that exploits a long-burst data transmission, which improves the memory bandwidth utilization, is essential. We propose a sparse stencil oriented Coarse Grain Reconfigurable Accelerator (CGRA) that we call EMAXV. Unlike on-demand multiple data loading on GPU, EMAXV loads the input data with a long burst transferring before the execution proceeds to conceal the sparse memory access and multi-threading cache races. It further obscures the memory loading latency with an execution latency from different activations. We evaluated the EMAXV and mobile GPU (Tegra K1) performances with identical host CPU's frequency and main memory bandwidth. Although EMAXV has much lower computation capability, we achieved four times performance of mobile GPU for light-field depth extraction and 89% of the performance for light-field image rendering.