Portable Techniques to Find Effective Memory Hierarchy Parameters

Application performance on modern microprocessors depends heavily on performance related characteristi cs of the underlying architecture. To achieve the best performance, an application must be tuned to both the target-processor fami ly and, in many cases, to the specific model, as memory-hierarch y parameters vary in important ways between models. Manual tuning is too inefficient to be practical; we need compilers t hat perform model-specific tuning automatically. To make such tuning practical, we need techniques that can automatically discern the critical performance parameters of a new computer system. While some of these parameters can be found in manuals, many of them cannot. To further complicate matters, compiler-based optimization should t arget the system’s behavior rather than its hardware limits. Effective cache capacities, in particular, can be smaller than the har dware limits for a number of reasons, such as sharing between cores or between instruction and data caches. Physical address mapp ing can also reduce the effective cache capacity. To address these challenges, we have developed a suite of portable tools that derive many of the effective parametersof the memory hierarchy. Our work builds on a long line of prior art that uses micro-benchmarks to analyze the memory system . We separate the design of a reference string that elicits a sp ecific behavior from the analysis that interprets that behavior. We present a novel set of reference strings and a new robust appr oach to analyzing the results. We present experimental validati on on a collection of 20 processors.

[1]  K. Yotov,et al.  X-ray: a tool for automatic measurement of hardware parameters , 2005, Second International Conference on the Quantitative Evaluation of Systems (QEST'05).

[2]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[3]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[4]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[5]  David A. Padua,et al.  P-Ray: A Software Suite for Multi-core Architecture Characterization , 2008, LCPC.

[6]  Todd C. Mowry,et al.  Architectural and compiler support for effective instruction prefetching: a cooperative approach , 2001, TOCS.

[7]  Juan Touriño,et al.  Servet: A benchmark suite for autotuning on multicore clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[9]  Steven A. Moyer,et al.  Performance of the IPSC/860 Node Architecture , 1991 .

[10]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[11]  Juan Carlos Pérez-Cortes,et al.  Optimum polygonal approximation of digitized curves , 1994, Pattern Recognit. Lett..

[12]  Jack J. Dongarra,et al.  Accurate Cache and TLB Characterization Using Hardware Counters , 2004, International Conference on Computational Science.