Performance and power consumption analysis of Arm Scalable Vector Extension

Modern CPUs not only have multiple cores but also support wide single instruction multiple data (SIMD). This trend is expected to grow in the future. In this paper, we examine the effect of the vector length and the number of out-of-order resources on the performance and the power consumption of programs having multiple vector lengths using the Arm Scalable Vector Extension. Based on the performed evaluation, we conclude that using a longer vector length with multicycle vector units leads to up to approximately 30% improvement in performance and 21% decrease in power consumption than when using a shorter vector length.

[1]  C. Hu,et al.  FinFET-a self-aligned double-gate MOSFET scalable to 20 nm , 2000 .

[2]  Andrei Poenaru,et al.  Evaluating the Effectiveness of a Vector-Length-Agnostic Instruction Set , 2020, Euro-Par.

[3]  K. J. Kuhn,et al.  Considerations for Ultimate CMOS Scaling , 2012, IEEE Transactions on Electron Devices.

[4]  Tajana Simunic,et al.  CoMETC: Coordinated management of energy/thermal/cooling in servers , 2013, ACM Trans. Design Autom. Electr. Syst..

[5]  Lasse Natvig,et al.  Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study † , 2017 .

[6]  Alejandro Rico,et al.  ARM HPC Ecosystem and the Reemergence of Vectors: Invited Paper , 2017, Conf. Computing Frontiers.

[7]  Sudhakar Yalamanchili,et al.  Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[8]  Danilo Manstretta,et al.  A Low-Power Active Self-Interference Cancellation Technique for SAW-Less FDD and Full-Duplex Receivers , 2017 .

[9]  Hiroshi Inoue How SIMD width affects energy efficiency: A case study on sorting , 2016, 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX).

[10]  Nigel Stephens,et al.  ARMv8-A next-generation vector architecture for HPC , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[11]  Mitsuhisa Sato,et al.  Evaluation of the RIKEN Post-K Processor Simulator , 2019, ArXiv.

[12]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Samuel Naffziger,et al.  2.2 AMD Chiplet Architecture for High-Performance Server and Desktop Products , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[14]  Miguel Tairum Cruz Performing SVE Studies using the Arm Instruction Emulator , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Henri-Pierre Charles,et al.  Micro-architectural simulation of in-order and out-of-order ARM microprocessors with gem5 , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[16]  Mitsuhisa Sato,et al.  Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Henri-Pierre Charles,et al.  Micro-architectural simulation of embedded core heterogeneity with gem5 and McPAT , 2015, RAPIDO '15.

[19]  Paul Walker,et al.  The ARM Scalable Vector Extension , 2017, IEEE Micro.

[20]  C. T. Vaughan,et al.  Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads , 2019, 2019 International Conference on High Performance Computing & Simulation (HPCS).