Servet: A benchmark suite for autotuning on multicore clusters

The growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. In this paper Servet, a suite of benchmarks focused on detecting a set of parameters with high influence in the overall performance of multicore systems, is presented. These benchmarks are able to detect the cache hierarchy, including their size and which caches are shared by each core, bandwidths and bottlenecks in memory accesses, as well as communication latencies among cores. These parameters can be used by auto-tuned codes to increase their performance in multicore clusters. Experimental results using different representative systems show that Servet provides very accurate estimates of the parameters of the machine architecture.

[1]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[2]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[3]  K. Yotov,et al.  X-ray: a tool for automatic measurement of hardware parameters , 2005, Second International Conference on the Quantitative Evaluation of Systems (QEST'05).

[4]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[5]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[7]  Basilio B. Fraguela,et al.  Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[8]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[9]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[10]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[11]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[12]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[13]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[14]  W HockneyRoger The communication challenge for MPP , 1994 .

[15]  S. Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP’s , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[16]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[17]  D. Padua,et al.  P-Ray : A Suite of Micro-benchmarks for Multi-core Architectures ⋆ , 2008 .

[18]  Jesper Larsson Träff,et al.  The Hierarchical Factor Algorithm for All-to-All Communication (Research Note) , 2002, Euro-Par.

[19]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[20]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[21]  Juan Touriño,et al.  Performance analysis of message-passing libraries on high-speed clusters , 2010, Comput. Syst. Sci. Eng..