论文信息 - Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams. In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.

[1] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[2] Anoop Gupta,et al. The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[3] Frank D. Luna,et al. Introduction to 3D Game Programming with DirectX 11 , 2008 .

[4] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[5] Bruce Anderson,et al. Accommodating memory latency in a low-cost rasterizer , 1997, HWWS '97.

[6] Alexis Vartanian,et al. Evaluation of high performance multicache parallel texture mapping , 1998, ICS '98.

[7] Woo-Chan Park,et al. Performance comparison of various cache systems for texture mapping , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[8] M. Martonosi,et al. Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9] Mainak Chaudhuri,et al. Introducing Hierarchy-awareness in replacement and bypass algorithms for last-level caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10] Mainak Chaudhuri,et al. Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[12] Carole-Jean Wu,et al. SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13] Mainak Chaudhuri,et al. Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[14] Pat Hanrahan,et al. Rendering complex scenes with memory-coherent ray tracing , 1997, SIGGRAPH.

[15] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[16] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[17] Jaehyuk Huh,et al. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18] S. Morein. Ati radeon hyperz technology , 2000 .

[19] Lance Williams,et al. Pyramidal parametrics , 1983, SIGGRAPH.

[20] Jay Torborg,et al. Talisman: commodity realtime 3D graphics for the PC , 1996, SIGGRAPH.

[21] Christoforos E. Kozyrakis,et al. Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[22] Yan Solihin,et al. Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[23] Homan Igehy,et al. Prefetching in a texture cache architecture , 1998, Workshop on Graphics Hardware.

[24] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[25] Michael F. Deering,et al. FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[26] Daniel A. Jimenez. Dead Block Replacement and Bypass with a Sampling Predictor , 2010 .

[27] Babak Falsafi,et al. Using dead blocks as a virtual victim cache , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28] Marcelo Yuffe,et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[29] Gavin S. P. Miller,et al. Hierarchical Z-buffer visibility , 1993, SIGGRAPH.

[30] Wolfgang Straßer,et al. Texram: a smart memory for texturing , 1996, IEEE Computer Graphics and Applications.

[31] Edwin Earl Catmull,et al. A subdivision algorithm for computer display of curved surfaces. , 1974 .

[32] Michael Shantz,et al. Multi-level texture caching for 3D graphics hardware , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[33] Mark J. Kilgard. Realizing OpenGL: two implementations of one architecture , 1997, HWWS '97.

[34] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[36] Michael C. Doggett,et al. Texture Caches , 2012, IEEE Micro.

[37] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38] Babak Falsafi,et al. Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[39] Homan Igehy,et al. Parallel texture caching , 1999, Workshop on Graphics Hardware.

[40] R. Govindarajan,et al. Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).