Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems

The emerging high-performance computing Exascale supercomputing system, which is anticipated to be available in 2020, will unravel many scientific mysteries. This extraordinary processing framework will accomplish a thousand-folds increment in figuring power contrasted with the current Petascale framework. The prospective framework will help development communities and researchers in exploring from conventional homogeneous to the heterogeneous frameworks that will be joined into energy efficient GPU devices along with traditional CPUs. For accomplishing ExaFlops execution through the Ultrascale framework, the present innovations are confronting several challenges. Huge parallelism is one of these challenges, which requires a novel low power consuming parallel programming approach for attaining massive performance. This paper introduced a new parallel programming model that achieves massive parallelism by combining coarse-grained and fine-grained parallelism over inter-node and intra-node computation respectively. The proposed framework is tri-hybrid of MPI, OpenMP, and compute unified device architecture (MOC) that compute input data over heterogeneous framework. We implemented the proposed model in linear algebraic dense matrix multiplication application, and compared the quantified metrics with well-known basic linear algebra subroutine libraries such as CUDA basic linear algebra subroutines library and KAUST basic linear algebra subprograms. MOC outperformed to all implemented methods and achieved massive performance by consuming less power. The proposed MOC approach can be considered an initial and leading model to deal emerging Exascale computing systems.

[1]  Rupak Biswas,et al.  High performance computing using MPI and OpenMP on multi-core parallel systems , 2011, Parallel Comput..

[2]  Min Zhou Petascale adaptive computational fluid dynamics , 2009 .

[3]  E. Wes Bethel,et al.  Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems , 2012, IEEE Transactions on Visualization and Computer Graphics.

[4]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[5]  Stephen A. Jarvis,et al.  Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark , 2011, PERV.

[6]  Inanc Senocak,et al.  Multi-level parallelism for incompressible flow computations on GPU clusters , 2013, Parallel Comput..

[7]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[8]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[9]  W. Zhang,et al.  Warp-X: A new exascale computing platform for beam–plasma simulations , 2017, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment.

[10]  Sunita Chandrasekaran,et al.  Implementing the OpenACC Data Model , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Thomas N. Theis,et al.  The End of Moore's Law: A New Beginning for Information Technology , 2017, Computing in Science & Engineering.

[12]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[13]  Andreas Kolb,et al.  Flow Driven GPGPU Programming combining Textual and Graphical Programming , 2016, PMAM@PPoPP.

[14]  Geoffrey C. Fox,et al.  Parallel Computing Works , 1994 .

[15]  Pete Beckman,et al.  Argo: An Exascale Operating System and Runtime , 2015 .

[16]  Reiji Suda,et al.  Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA , 2009, 2009 International Conference on Computational Science and Engineering.

[17]  Jack J. Dongarra,et al.  Experiences in autotuning matrix multiplication for energy minimization on GPUs , 2015, Concurr. Comput. Pract. Exp..

[18]  m. usmanashraf Hybrid Model Based Testing Tool Architecture for Exascale Computing System , 2015 .

[19]  Alex Ramírez,et al.  The low-power architecture approach towards exascale computing , 2011, ScalA '11.

[20]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[21]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[22]  Robert Edwards,et al.  Lattice QCD Application Development within the US DOE Exascale Computing Project , 2017 .

[23]  Francisco de Sande,et al.  Optimization strategies in different CUDA architectures using llCoMP , 2012, Microprocess. Microsystems.

[24]  Bettina Schnor,et al.  A comparison of CUDA and OpenACC: Accelerating the Tsunami Simulation EasyWave , 2014, ARCS Workshops.

[25]  K. A. Gallivan,et al.  Parallel Algorithms for Dense Linear Algebra Computations , 1990, SIAM Rev..

[26]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[27]  Y. Raghu Reddy,et al.  A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence , 2010, Parallel Comput..

[28]  Kwan-Liu Ma,et al.  In-situ processing and visualization for ultrascale simulations , 2007 .

[29]  Jack J. Dongarra,et al.  The quest for petascale computing , 2001, Comput. Sci. Eng..

[30]  Wolfgang Frings,et al.  Measuring power consumption on IBM Blue Gene/P , 2011, Computer Science - Research and Development.

[31]  Rajeev Thakur,et al.  An implementation and evaluation of the MPI 3.0 one‐sided communication interface , 2016, Concurr. Comput. Pract. Exp..

[32]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[33]  Dirk Eddelbuettel,et al.  CRAN Task View: High-Performance and Parallel Computing with R , 2020 .

[34]  Joseph Y.-T. Leung,et al.  Handbook of Scheduling: Algorithms, Models, and Performance Analysis , 2004 .

[35]  Bronis R. de Supinski,et al.  Early Experiences Porting Three Applications to OpenMP 4.5 , 2016, IWOMP.

[36]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[37]  Xiaoqian Zhu,et al.  Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters , 2012, Comput. Phys. Commun..

[38]  Stephen A. Jarvis,et al.  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[39]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[40]  Miroslav Hajdukovic,et al.  MPI-CUDA parallelization of a finite-strip program for geometric nonlinear analysis: A hybrid approach , 2011, Adv. Eng. Softw..

[41]  David E. Keyes,et al.  KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..

[42]  Amirali Baniasadi,et al.  IPMACC: Open Source OpenACC to CUDA/OpenCL Translator , 2014, ArXiv.

[43]  Guirong Liu,et al.  A face‐based smoothed finite element method (FS‐FEM) for 3D linear and geometrically non‐linear solid mechanics problems using 4‐node tetrahedral elements , 2009 .

[44]  Suchuan Dong,et al.  Dual-level parallelism for high-order CFD methods , 2004, Parallel Comput..

[45]  Sven Karlsson,et al.  Towards Unifying OpenMP Under the Task-Parallel Paradigm - Implementation and Performance of the taskloop Construct , 2016, IWOMP.

[46]  Jian-Ming Jin,et al.  An OpenMP-CUDA Implementation of Multilevel Fast Multipole Algorithm for Electromagnetic Simulation on Multi-GPU Computing Systems , 2013, IEEE Transactions on Antennas and Propagation.

[47]  Kengo Nakajima Three-level hybrid vs. flat MPI on the Earth Simulator: parallel iterative solvers for finite-element method , 2005 .

[48]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[49]  Shuangshuang Jin,et al.  Thread Group Multithreading: Accelerating the Computation of an Agent-Based Power System Modeling and Simulation Tool -- C GridLAB-D , 2014, 2014 47th Hawaii International Conference on System Sciences.