Parallel Implementation and Optimization of Regional Ocean Modeling System (ROMS) Based on Sunway SW26010 Many-Core Processor

Nowadays, the ocean numerical models are gradually developing towards multi-physical process and high resolution, with the increment of measured ocean data and more in-depth research in ocean field. Therefore, general computing capability is no longer able to meet these models’ needs. It is necessary to utilize more powerful hardware and parallel software to process the ocean numerical model programs. China has made great process in the research and development of homegrown high performance processors, and sunway sw26010 many-core processor is the most outstanding representative. This paper focuses the lag of the ocean numerical model software matched with homegrown processors, and makes parallel implementation and optimization to regional ocean modeling system (ROMS) based on sunway sw26010 many-core processor for the first time. Furthermore, three kinds of programming methods are utilized in this paper, including OpenACC*, athread with fortran and athread with C. The comparison among these programming methods has been made, from programming method, workload and execution efficiency, which has a practical guiding significance for the programmers that use sunway sw26010 many-core processors. The evaluation measures the execution times and speedups of model kernel and total ROMS with different optimizations, input datasets and numbers of computing processing elements (CPEs). The result shows that, to compare with original ROMS, the speedup of optimized hotspot program can be up to $3.69\times$ .

[1]  Alexander F. Shchepetkin,et al.  The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model , 2005 .

[2]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[3]  Maria Pantoja,et al.  Enhancing regional ocean modeling simulation performance with the Xeon Phi architecture , 2017, OCEANS 2017 - Aberdeen.

[4]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[5]  Xin Liu,et al.  A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Su Wang,et al.  A Hybrid Parallel Genetic Algorithm with Dynamic Migration Strategy Based on Sunway Many-Core Processor , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications Workshops (HPCCWS).

[9]  Yan Zhang,et al.  A customized GPU acceleration of the princeton ocean model , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Hui Lv,et al.  Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture , 2015, Journal of Computer Science and Technology.

[11]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[14]  Changsheng Chen,et al.  An Unstructured Grid, Finite-Volume, Three-Dimensional, Primitive Equations Ocean Model: Application to Coastal Ocean and Estuaries , 2003 .

[15]  Rainer Bleck,et al.  An oceanic general circulation model framed in hybrid isopycnic-Cartesian coordinates , 2002 .

[16]  Lei Zhao,et al.  A Novel Acceleration Method for DGTD Algorithm on Sunway TaihuLight , 2018, 2018 IEEE Asia-Pacific Conference on Antennas and Propagation (APCAP).

[17]  Guangwen Yang,et al.  Improving the scalability of the ocean barotropic solver in the community earth system model , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Interner Bericht VAMPIR: Visualization and Analysis of MPI Resources , 1996 .

[19]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[20]  Chao Yang,et al.  Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight , 2019, Journal of Computer Science and Technology.

[21]  Cecelia DeLuca,et al.  The architecture of the Earth System Modeling Framework , 2003, Computing in Science & Engineering.

[22]  Shun Xu,et al.  Accelerating Lattice QCD on Sunway Many-Core Processor , 2018, 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom).

[23]  R. C. Malone,et al.  Parallel ocean general circulation modeling , 1992 .

[24]  Bo Li,et al.  PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[25]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[26]  Tao Wu,et al.  Optimization of parallel program based on lattice BGK method , 2019, ACM TUR-C.

[27]  Chris Lupo,et al.  High performance regional ocean modeling with GPU acceleration , 2013, 2013 OCEANS - San Diego.

[28]  Chao Yang,et al.  Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer , 2018, ACM Trans. Archit. Code Optim..

[29]  Xu Ping,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016 .

[30]  A. Blumberg,et al.  A Description of a Three‐Dimensional Coastal Ocean Circulation Model , 2013 .

[31]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[32]  Srikanth Yalavarthi,et al.  An early experience of regional ocean modelling on intel many integrated core architecture , 2014, 2014 21st International Conference on High Performance Computing (HiPC).