Power/Performance/Area Evaluations for Next-Generation HPC Processors using the A64FX Chip

Future HPC systems, including post-exascale supercomputers, will face severe problems such as the slowing-down of Moore's law and the limitation of power supply. To achieve desired system performance improvement while counteracting these issues, the hardware design optimization is a key factor. In this paper, we investigate the future directions of SIMD-based processor architectures by using the A64FX chip and a customized version of power/performance/area simulators, i.e., Gem5 and McPAT. More specifically, based on the A64FX chip, we firstly customize various energy parameters in the simulators, and then evaluate the power and area reductions by scaling the technology node down to 3nm. Moreover, we investigate also the achievable FLOPS improvement at 3nm by scaling the number of cores, SIMD width, and FP pipeline width under power/area constraints. The evaluation result indicates that no further SIMD/pipeline width scaling will help with improving FLOPS due to the memory system bottleneck, especially on L1 data caches and FP register files. Based on the observation, we discuss the future directions of SIMD-based HPC processors.

[1]  Niraj K. Jha,et al.  McPAT-PVT: Delay and Power Modeling Framework for FinFET Processor Architectures Under PVT Variations , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Mitsuhisa Sato,et al.  Accuracy Improvement of Memory System Simulation for Modern Shared Memory Processor , 2020, HPC Asia.

[4]  Partha Pratim Pande,et al.  Machine Learning for Design Space Exploration and Optimization of Manycore Systems , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Mohammad Alian,et al.  dist-gem5: Distributed simulation of computer clusters , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Mitsuhisa Sato,et al.  Performance and power consumption analysis of Arm Scalable Vector Extension , 2020 .

[7]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[8]  Y. Kodama,et al.  Co-Design for A64FX Manycore Processor and ”Fugaku” , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Eishi Arima Classification-Based Unified Cache Replacement via Partitioned Victim Address History , 2020, 2020 23rd Euromicro Conference on Digital System Design (DSD).

[11]  Onur Mutlu,et al.  A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[13]  Mitsuhisa Sato,et al.  Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Niraj K. Jha,et al.  McPAT-Monolithic: An Area/Power/Timing Architecture Modeling Framework for 3-D Hybrid Monolithic Multicore Systems , 2020, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Pugach Nataliya,et al.  International roadmap for devices and systems. Cryogenic electronics and quantum information processing. 2018 Update , 2019 .

[17]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[18]  Diederik Verkest,et al.  EMPIRE: Empirical power/area/timing models for register files , 2009, Microprocess. Microsystems.

[19]  Hiroshi Nakamura,et al.  Immediate sleep: Reducing energy impact of peripheral circuits in STT-MRAM caches , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).