Detailed cache simulation for detecting bottleneck, miss reason and optimization potentialities

Cache locality optimization is an efficient way for reducing the idle time of modern processors in waiting for needed data. This kind of optimization can be achieved either on the side of programmers or compilers with code level optimization or at system level through appropriate schemes, like reconfigurable cache organization and adequate prefetching or replacement strategies. For the former users need to know the problem, the reason, and the solution, while for the latter a platform is required for evaluating proposed and novel approaches.As existing simulation systems do not provide such information and platforms, we implemented a cache simulator that models the complete cache hierarchy and associated techniques. More specifically, it analyzes the feature of cache miss and provides information about the runtime accesses to data structures and the cache access behavior. Together with a visualization tool, this information enables the user to detect access hotspots and optimization strategies for tackling them. For supporting the study of different techniques with respect to cache configuration and management, this simulator models a variety of cache line replacement and prefetching policies, and allows the user to specify any cache organization, including cache size, cache set size, block size, and associativity. The simulator hence forms a research platform for investigating the influence of these techniques on the execution behavior of applications.

[1]  Peter S. Magnusson,et al.  Efficient memory simulation in SimICS , 1995, Proceedings of Simulation Symposium.

[2]  Wolfgang Karl,et al.  YACO: A User Conducted Visualization Tool for Supporting Cache Optimization , 2005, HPCC.

[3]  Mendel Rosenblum,et al.  Using complete machine simulation to understand computer system behavior , 1998 .

[4]  Chung-Ta King,et al.  MICA: a memory and interconnect simulation environment for cache-based architectures , 2000, Proceedings 33rd Annual Simulation Symposium (SS 2000).

[5]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[6]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[7]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[8]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[9]  David J. Shippy,et al.  The POWER2 performance monitor , 1994, IBM J. Res. Dev..

[10]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[11]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[12]  Erik Hagersten,et al.  SIP: Performance Tuning through Source Code Interdependence , 2002, Euro-Par.

[13]  Josep Torrellas,et al.  The Augmint multiprocessor simulation toolkit for Intel x86 architectures , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[14]  Margaret Martonosi,et al.  Tuning Memory Performance of Sequential and Parallel Programs , 1995, Computer.

[15]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[16]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..