Automatic measurement of memory hierarchy parameters

The running time of many applications is dominated by the cost of memory operations. To optimize such applications for a given platform, it is necessary to have a detailed knowledge of the memory hierarchy parameters of that platform. In practice, this information is poorly documented if at all. Moreover, there is growing interest in self-tuning, autonomic software systems that can optimize themselves for different platforms; these systems must determine memory hierarchy parameters automatically without human intervention.One solution is to use micro-benchmarks to determine the parameters of the memory hierarchy. In this paper, we argue that existing micro-benchmarks are inadequate, and present novel micro-benchmarks for determining parameters of all levels of the memory hierarchy, including registers, all data caches and the translation look-aside buffer. We have implemented these micro-benchmarks in a tool called X-Ray that can be ported easily to new platforms. We present experimental results that show that X-Ray successfully determines memory hierarchy parameters on current platforms, and compare its accuracy with that of existing tools.

[1]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[2]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[3]  Keshav Pingali,et al.  X-Ray : Automatic Measurement of Hardware Parameters , 2004 .

[4]  R. Saavedra,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times USC-CS-93-546 , 1993 .

[5]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[6]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[7]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[8]  Juan E. Navarro,et al.  Practical, transparent operating system support for superpages , 2002, OSDI '02.

[9]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[12]  Jack J. Dongarra,et al.  Accurate Cache and TLB Characterization Using Hardware Counters , 2004, International Conference on Computational Science.

[13]  Jack W. Davidson,et al.  Automatic memory hierarchy characterization , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[14]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[15]  Martin Burtscher,et al.  Hybrid Load-Value Predictors , 2002, IEEE Trans. Computers.

[16]  Carl Staelin,et al.  Mhz: Anatomy of a Micro-benchmark , 1998, USENIX Annual Technical Conference.

[17]  Min Zhou,et al.  Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[18]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.