Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline

GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smartphones. Tile-Based Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.

[1]  Eric Haines,et al.  Fast, Low Memory Z-Buffering when Performing Medium-Quality Rendering , 1996, J. Graphics, GPU, & Game Tools.

[2]  Tomas Akenine-Möller,et al.  Graphics for the masses: a hardware rasterization architecture for mobile phones , 2003, ACM Trans. Graph..

[3]  Jose-Maria Arnau,et al.  Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4]  Hoi-Jun Yoo,et al.  A low-power handheld GPU using logarithmic arithmetic and triple DVFS power domains , 2007, GH '07.

[5]  Jian Huang,et al.  Exploiting basic block value locality with block reuse , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[6]  Sean Ellis,et al.  Adaptive scalable texture compression , 2012, EGGH-HPG'12.

[7]  W. W. PETERSONt,et al.  Cyclic Codes for Error Detection * , 2022 .

[8]  Jaakko Lehtinen,et al.  Decoupled sampling for graphics pipelines , 2011, TOGS.

[9]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[10]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Jose-Maria Arnau,et al.  TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems , 2013, ICS '13.

[12]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[13]  Jose-Maria Arnau,et al.  Parallel frame rendering: Trading responsiveness for energy on a mobile GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[15]  Slo-Li Chu,et al.  An Energy-Efficient Unified Register File for Mobile GPUs , 2011, 2011 IFIP 9th International Conference on Embedded and Ubiquitous Computing.

[16]  James L. Massey,et al.  Shift-register synthesis and BCH decoding , 1969, IEEE Trans. Inf. Theory.

[17]  Antonio González,et al.  Ultra-low power render-based collision detection for CPU/GPU systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Min Sik Kim,et al.  High Performance Table-Based Algorithm for Pipelined CRC Calculation , 2013, J. Commun..

[19]  Michael Wimmer,et al.  Coherent Hierarchical Culling: Hardware Occlusion Queries Made Useful , 2004, Comput. Graph. Forum.

[20]  Steven W. Zucker,et al.  Frame-to-frame coherence and the hidden surface computation: constraints for a convex world , 1982, TOGS.

[21]  Robert Toth,et al.  A sort-based deferred shading architecture for decoupled sampling , 2013, ACM Trans. Graph..

[22]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[23]  Larry Rudolph,et al.  Accelerating multi-media processing by implementing memoing in multiplication and division units , 1998, ASPLOS VIII.

[24]  Gernot Heiser,et al.  An Analysis of Power Consumption in a Smartphone , 2010, USENIX Annual Technical Conference.