论文信息 - An Efficient Method for Optimizing PETSc on the Sunway TaihuLight System

An Efficient Method for Optimizing PETSc on the Sunway TaihuLight System

High performance computing platforms can bring us great benefits on processing various ubiquitous computing tasks. The Sunway TaihuLight supercomputer is a novel high performance computing platform, which is ranked No. 1 among the TOP500 list in the world. In this paper, we focus on how to optimize the Portable and Extensible Toolkit for Scientific computation (PETSc), running on supercomputers. The main motivations for this study are twofold: (i) PETSc is widely and frequently used in many scientific research fields such as biology, fusion, artificial intelligence, geosciences, etc; and (ii) the current nuclear PETSc does not fully utilize the potential of the Sunway TaighLight system, especially its powerful processor, i.e., SW26010 processor. To achieve high efficiency of PETSc, the central idea of our optimizations is to fully promote the performance of time-consuming and frequently used computation components (e.g., matrix and vector modules). To this end, we propose (i) accelerating kernel codes with computing processing elements (CPEs), in which new compression format and targeted optimizations for vector and matrix operations are devised; and (ii) using more efficient memory access schemes. We have implemented our proposals and evaluated its effectiveness and efficiency through a real world application — Structural Finite Element Analysis (SFEA). We obtain 16~32 times speedup for a single SW26010 processor. As an extra finding, the results also show a high scalability on over 8,000 computing nodes, i.e., 532,500 cores.

[1] Kai Kunze,et al. Towards performance feedback through tactile displays to improve learning archery , 2015, UbiComp/ISWC Adjunct.

[2] Matthew G. Knepley,et al. Extreme-Scale Multigrid Components within PETSc , 2016, PASC.

[3] David E. Keyes,et al. Unstructured computational aerodynamics on many integrated core architecture , 2014, Parallel Comput..

[4] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5] Matthew G. Knepley,et al. Preliminary Implementation of PETSc Using GPUs , 2013 .

[6] Umakishore Ramachandran,et al. MB++: An Integrated Architecture for Pervasive Computing and High-Performance Computing , 2007, 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007).

[7] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[8] William Gropp,et al. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[9] Lois C. McInnes,et al. Parallel simulation of compressible flow using automatic differentiation and PETSc , 2001, Parallel Comput..

[10] Olaf Kolditz,et al. A Parallel FEM Scheme for the Simulation of Large Scale Thermochemical Energy Storage with Complex Geometries using PETSc Routines , 2015 .

[11] Mauro Conti,et al. Anonymous end-to-end communications in adversarial mobile clouds , 2017, Pervasive Mob. Comput..

[12] Jeffrey S. Chase,et al. Active and accelerated learning of cost models for optimizing scientific applications , 2006, VLDB.

[13] Marek Blazewicz,et al. Using GPU's to accelerate stencil-based computation kernels for the development of large scale scientific applications on heterogeneous systems , 2012, PPoPP '12.

[14] Molly Wright Steenson,et al. Theme issue on Histories of Ubicomp , 2017, Personal and Ubiquitous Computing.

[15] Jacques Demerjian,et al. Evaluation of mobile cloud architectures , 2017, Pervasive Mob. Comput..

[16] Lawrence Mitchell,et al. Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation , 2013, ISC.

[17] Luisa Carracciuolo,et al. Toward a fully parallel multigrid in time algorithm in PETSc environment: A case study in ocean models , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[18] Koichi Kise,et al. Real-time wordometer demonstration using commercial EoG glasses , 2017, UbiComp/ISWC Adjunct.

[19] Jesús Carretero,et al. Introduction to the special section on "Optimization of parallel scientific applications with accelerated high-performance computers" , 2015, Comput. Electr. Eng..

[20] Salvatore Cuomo,et al. Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation , 2015, ICCS.