Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations

Abstract The inadequate public information of China’s SW26010 processor’s micro-architecture prevents global researchers from improving application performances on the TaihuLight supercomputer. This study aims to illuminate the uncharted area of SW26010 in order to provide important information for performance optimizations and modeling. First, we developed a micro-benchmark suite, swCandle, to evaluate the key micro-architectural features. The benchmark results revealed some unanticipated findings beyond the publicly available data. For instance, the broadcast mode of register communications has the same latency as the peer-to-peer mode. Second, we applied the roofline model, with the key parameters obtained with swCandle, to identify the key programming challenge of SW26010. Third, based on the micro-benchmark results and the roofline model analysis, we proposed a systematic guideline for performance optimizations on SW26010 and instantiated the guideline with two cases. The methodology we developed in this study, that infers a processor’s micro-architecture design from micro-benchmark results, can also be applied on other processors lacking of public information.

[1]  Satoshi Matsuoka,et al.  Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[2]  Yi Zheng,et al.  DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[3]  Wenguang Chen,et al.  Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[5]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[6]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[7]  Wenguang Chen,et al.  Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Guido Juckeland,et al.  BenchIT - Performance Measurements and Comparison for Scientific Applications , 2003, PARCO.

[9]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[10]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Xin Liu,et al.  Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[12]  Wu-chun Feng,et al.  The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.

[13]  Sabela Ramos,et al.  Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Jianbin Fang,et al.  Test-driving Intel Xeon Phi , 2014, ICPE.

[15]  Jian Zhang,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.