BlackjackBench: Portable Hardware Characterization with Automated Results' Analysis

DARPA’s AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targetted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and NUMA characteristics of the system, properties of the processing cores affecting instruction execution speed, and the length of the OS scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor.

[1]  Juan Touriño,et al.  Servet: A benchmark suite for autotuning on multicore clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[3]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[4]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[5]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[7]  Keshav Pingali,et al.  Automatic Measurement of Instruction Cache Capacity , 2005, LCPC.

[8]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[9]  R. Clint Whaley,et al.  Achieving accurate and context‐sensitive timing for code optimization , 2008, Softw. Pract. Exp..

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  D. Padua,et al.  P-Ray : A Suite of Micro-benchmarks for Multi-core Architectures ⋆ , 2008 .

[12]  Carl Staelin,et al.  Mhz: Anatomy of a Micro-benchmark , 1998, USENIX Annual Technical Conference.

[13]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[14]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[15]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[16]  Karan Singh,et al.  Learning Models in Self-Optimizing Systems , 2007 .

[17]  Jack J. Dongarra,et al.  Accurate Cache and TLB Characterization Using Hardware Counters , 2004, International Conference on Computational Science.

[18]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[19]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[20]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[22]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[23]  Golub Gene H. Et.Al Matrix Computations, 3rd Edition , 2007 .