Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems

MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI colletives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting.

[1]  Edgar Gabriel,et al.  A Tool for Optimizing Runtime Parameters of Open MPI , 2008, PVM/MPI.

[2]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Michael F. P. O'Boyle,et al.  Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code , 2014, CC.

[5]  Shweta Jha,et al.  Impact and Limitations of Point-to-Point Performance on Collective Algorithms , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[6]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[7]  Jianbin Fang,et al.  Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture , 2019, International Journal of Parallel Programming.

[8]  Zheng Wang,et al.  Fast Automatic Heuristic Construction Using Active Learning , 2014, LCPC.

[9]  Xuejun Yang,et al.  Tianhe-1A Interconnect and Message-Passing Services , 2012, IEEE Micro.

[10]  Michael F. P. O'Boyle,et al.  Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[11]  Kevin Harms,et al.  Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Radu Prodan,et al.  Tuning MPI Runtime Parameter Setting for High Performance Computing , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[13]  Michael F. P. O'Boyle,et al.  Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..

[14]  Alexey L. Lastovetsky,et al.  Hierarchical redesign of classic MPI reduction algorithms , 2016, The Journal of Supercomputing.

[15]  Yi Zheng,et al.  The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.

[16]  Michael F. P. O'Boyle,et al.  Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[19]  Jie Wang,et al.  Optimizing MPI Runtime Parameter Settings by Using Machine Learning , 2009, PVM/MPI.

[20]  Xiaofang Zhao,et al.  Multi-core aware optimization for MPI collectives , 2008, 2008 IEEE International Conference on Cluster Computing.

[21]  Alexey L. Lastovetsky,et al.  Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms , 2015, Simul. Model. Pract. Theory.

[22]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[23]  Pavlos Petoumenos,et al.  Minimizing the cost of iterative compilation with active learning , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Zizhong Chen,et al.  Runtime Optimization of Broadcast Communications Using Dynamic Network Topology Information from MPI , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[25]  Barry Porter,et al.  Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach , 2017 .

[26]  廖湘科,et al.  High Performance Interconnect Network for Tianhe System , 2015 .

[27]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[28]  Torsten Hoefler,et al.  Cache-Oblivious MPI All-to-All Communications Based on Morton Order , 2018, IEEE Transactions on Parallel and Distributed Systems.

[29]  Peng Zhang,et al.  Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[30]  Xin Yuan,et al.  Message scheduling for all-to-all personalized communication on ethernet switched clusters , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[31]  Zheng Wang,et al.  Machine Learning in Compiler Optimization , 2018, Proceedings of the IEEE.

[32]  Zheng Wang,et al.  Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.

[33]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[34]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[35]  Yansong Feng,et al.  Proteus: network-aware web browsing on heterogeneous mobile systems , 2018, CoNEXT.

[36]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[37]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[38]  Ling Gao,et al.  Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[39]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Michael F. P. O'Boyle,et al.  Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.

[41]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[42]  Yehia El-khatib,et al.  Adaptive deep learning model selection on embedded systems , 2018, LCTES.