The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.

[1]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[2]  Jean-François Méhaut,et al.  The Mont-Blanc prototype: an alternative approach for high-performance computing systems , 2016 .

[3]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[4]  Takeshi Iwashita,et al.  Algebraic multicolor ordering for parallelized ICCG solver in finite-element analyses , 2002 .

[5]  George Bosilca,et al.  UCX: An Open Source Framework for HPC Network APIs and Beyond , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[6]  Massimiliano Fatica,et al.  A CUDA Implementation of the High Performance Conjugate Gradient Benchmark , 2014, PMBS@SC.

[7]  Alejandro Rico,et al.  ARM HPC Ecosystem and the Reemergence of Vectors: Invited Paper , 2017, Conf. Computing Frontiers.

[8]  Jack Dongarra,et al.  Sunway TaihuLight supercomputer makes its appearance , 2016 .

[9]  Christoph Hagleitner,et al.  Boosting the Efficiency of HPCG and Graph500 with Near-Data Processing , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[10]  Hiroaki Kobayashi,et al.  Performance and Power Analysis of SX-ACE Using HP-X Benchmark Programs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Hiroshi Nakashima,et al.  Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Mateo Valero,et al.  Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Enrico Calore,et al.  Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters , 2018 .

[14]  C. W. Glass,et al.  Performance Modeling of the HPCG Benchmark , 2014, PMBS@SC.

[15]  Paul Walker,et al.  The ARM Scalable Vector Extension , 2017, IEEE Micro.

[16]  Gene H. Golub,et al.  Matrix computations , 1983 .

[17]  Naoya Maruyama,et al.  High-performance conjugate gradient performance improvement on the K computer , 2016, Int. J. High Perform. Comput. Appl..

[18]  Ananta Tiwari,et al.  Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation , 2014, Euro-Par.

[19]  Karl W. Schulz,et al.  Cluster Computing with OpenHPC , 2016 .

[20]  Filippo Mantovani,et al.  Is Arm software ecosystem ready for HPC , 2017 .

[21]  Sandia Report,et al.  HPCG Technical Specification , 2013 .

[22]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[24]  Jun Zhou,et al.  Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[25]  Chao Yang,et al.  Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.

[26]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .