Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster

As the supercomputing system entering the exascale era, power consumption becomes a major concern in the system design. Among all the novel techniques for reducing power consumption, ARM architecture is gaining popularity in the HPC community due to its low power footprint and high energy efficiency. As one of the initiatives for addressing the exascale challenges in China, Tianhe-3 supercomputer has adopted the technology roadmap of using the many-core ARM architecture with home-built phytium-2000\(+\) and matrix-2000\(+\) processors. In this paper, we evaluate several linear algebra kernels such as matrix-matrix multiplication, matrix-vector multiplication and triangular solver with both sparse and dense datasets. These linear algebra kernels are good performance indicators of the prototype Tianhe-3 cluster. Comprehensive analysis is performed using roofline model to identify the directions for performance optimization from both hardware and software perspectives. In addition, we compare the performance of phytium-2000\(+\) and matrix-2000\(+\) with widely used KNL processor. We believe this paper provides valuable experiences and insights as work-in-progress towards exascale for the HPC community.

[1]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[2]  Michael Garland,et al.  Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[4]  Jørgen Fredsøe,et al.  A wave generation toolbox for the open‐source CFD library: OpenFoam® , 2012 .

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Jean Luca Bez,et al.  Performance and energy efficiency analysis of HPC physics simulation applications in a cluster of ARM processors , 2017, Concurr. Comput. Pract. Exp..

[7]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[8]  P. O. A. Navaux,et al.  Time-to-Solution and Energy-to-Solution: A Comparison between ARM and Xeon , 2012, 2012 Third Workshop on Applications for Multi-Core Architecture.

[9]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[10]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[11]  Oliver Ray,et al.  Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on the ARM Cortex-M3 , 2017, ArXiv.

[12]  Alex Ramírez,et al.  The low power architecture approach towards exascale computing , 2013, J. Comput. Sci..

[13]  Jack Dongarra,et al.  Report on the TianHe-2A System , 2017 .

[14]  Jesús Labarta,et al.  The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform , 2018 .

[15]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[16]  Charles Zhang Mars: A 64-core ARMv8 processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[17]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[18]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[19]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[20]  Jianbin Fang,et al.  Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture , 2019, International Journal of Parallel Programming.

[21]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[22]  Christian F. A. Negre,et al.  The basic matrix library (BML) for quantum chemistry , 2018, The Journal of Supercomputing.

[23]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  Eduard Ayguadé,et al.  The Mont-Blanc Prototype: An Alternative Approach for HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.