Automatic tuning of sparse matrix-vector multiplication on multicore clusters

To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector (SpMV) multiplication kernel implemented in a partitioned global address space language, which supports a hybrid thread- and process-based communication layer for multicore systems. One-sided communication is used for inter-node data exchange, while intra-node communication uses a mix of process shared memory and multithreading. We develop performance models to facilitate selecting the best configuration of threads and processes hybridization as well as the best communication pattern for SpMV. As a result, our tuned SpMV in the hybrid runtime environment consumes less memory and reduces inter-node communication volume, without damaging the data locality. Experiments are conducted on 12 real sparse matrices. On 16-node Xeon and 8-node Opteron clusters, our tuned SpMV kernel gets on average 1.4X and 1.5X improvement in performance over the well-optimized process-based message-passing implementation, respectively.抽象创新点为了获得理想的性能及可扩展性, 并行应用通常需要精细调优, 以更好地利用多核集 群节点内部的高度并行性, 并减少节点间通信开销. 本文研究了多核集群上稀疏矩阵向量(SpMV) 乘法的自动调优技术, 其中SpMV代码基于划分全局地址空间(PGAS)语言UPC实现. UPC 通信层支持多线程/多进程混合运行时环境, 其中节点间数据交换通过单边通信实现, 而节点内通信通过 PSHM(Process SHare Memory) 以及多线程进行优化. 本文为此类混合运行时环境 (如UPC) 建立通信性能模型, 并基于该模型为 SpMV 选择最优混合运行时配置参数以及通信模式, 在保证数据局部性的前提下, 减少内存开销及节点间通信量. 通过对 12个实际稀疏矩阵进行实验测试表明, 相对于高度手工优化的 MPI SpMV 实现, 自动调优后的 SpMV 在 16 节点至强集群及 8 节点皓龙集群上分别获得 1.4 倍及 1.5 倍性能提升.

[1]  Torsten Hoefler,et al.  Improved MPI collectives for MPI processes in shared address spaces , 2014, Cluster Computing.

[2]  Tarek A. El-Ghazawi,et al.  Load-balancing in sparse matrix-vector multiplication , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[3]  Jingling Xue,et al.  Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.

[4]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[5]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[7]  Rob H. Bisseling,et al.  Communication balancing in parallel sparse matrix-vector multiplication , 2005 .

[8]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[9]  Nicholas J. Wright,et al.  Accelerating Applications at Scale Using One-Sided Communication , 2012 .

[10]  Umit Catalyurek,et al.  Constrained Fine-Grain Parallel Sparse Matrix Distribution , 2006 .

[11]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[12]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[13]  Katherine Yelick,et al.  Titanium Language Reference Manual (Version 2.20) , 2006 .

[14]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[15]  Ankit Jain pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .

[16]  Rudolf Eigenmann,et al.  Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems , 2008, ICS '08.

[17]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[18]  Nico M. Temme,et al.  Computation of the Marcum Q-function , 2013, ArXiv.

[19]  Rajesh Nishtala,et al.  Architectural Probes for Measuring Communication Overlap Potential , 2006 .

[20]  Katherine A. Yelick,et al.  Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[21]  Laxmikant V. Kalé,et al.  Automatic MPI to AMPI Program Transformation Using Photran , 2010, Euro-Par Workshops.

[22]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[23]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[24]  Rolf Rabenseifner,et al.  Hybrid Parallel Programming on HPC Platforms , 2003 .

[25]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[26]  Torsten Hoefler,et al.  Ownership passing: efficient distributed memory programming on multi-core systems , 2013, PPoPP '13.

[27]  Katherine Yelick,et al.  Titanium Language Reference Manual, version 2.19 , 2005 .