Fast data-locality profiling of native execution

Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

[1]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[2]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[4]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[5]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[7]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[8]  Brian J. N. Wylie,et al.  Memory Profiling using Hardware Counters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Jeffrey K. Hollingsworth,et al.  Using Hardware Performance Monitors to Isolate Memory Bottlenecks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[10]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[11]  Koen De Bosschere,et al.  DIOTA: Dynamic Instrumentation, Optimization and Transformation of Applications , 2002, PACT 2002.

[12]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[13]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[14]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[15]  Robert Zak,et al.  SMP System Interconnect Instrumentation for Performance Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Trevor N. Mudge,et al.  Trap-driven simulation with Tapeworm II , 1994, ASPLOS VI.

[17]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[18]  Yijun Yu,et al.  Visualization Enables the Programmer to Reduce Cache Misses , 2002, IASTED PDCS.

[19]  M. Schulz,et al.  Identifying and Exploiting Spatial Regularity in Data Memory References , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[21]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[22]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[23]  Margaret Martonosi,et al.  Tuning Memory Performance of Sequential and Parallel Programs , 1995, Computer.

[24]  Erik Hagersten,et al.  SIP: Performance Tuning through Source Code Interdependence , 2002, Euro-Par.

[25]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[26]  David B. Whalley,et al.  Tools for application-oriented performance tuning , 2001, ICS '01.

[27]  Håkan Grahn,et al.  SimICS/Sun4m: A Virtual Workstation , 1998, USENIX Annual Technical Conference.

[28]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[29]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[30]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[31]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[32]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[33]  Thomas M. Conte,et al.  Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation , 1998, IEEE Trans. Computers.

[34]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[35]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.