A combined fast/cycle accurate simulation tool for reconfigurable accelerator evaluation: application to distributed data management

Parallel computing systems based on reconfigurable accelerators are becoming (1) increasingly heterogeneous, (2) difficult to design and (3) complex to model. Such modeling of a parallel computing system helps to evaluate its performance and to improve its architecture before prototyping. This paper presents a simulation tool aiming to study the integration of reconfigurable accelerators in scalable distributed systems and runtimes, such as S-DSM systems, where S-DSM (software-distributed shared memory) is a paradigm to ease data management among distributed nodes. This tool allows us to simulate the execution of irregular compute kernels accessing distributed data. To deal with the complexity of modeling (3) the complete system we used a hybrid methodology. We integrated the simulation engine into the S-DSM. The distributed data management part is executed on the physical architecture allowing to generate precise and faithful latencies, and the accelerator simulation is cycle accurate. We used general sparse matrix-matrix multiplication (SpGEMM) as a case study.We show that the use of this tool makes it possible to analyze the behavior of an heterogeneous system (1) with rapid prototyping and simulation. The analysis of the results allowed to determine the correct sizing of the architecture (2) to obtain the best performance. The tool allowed to identify the bottleneck of our architecture and confirmed the possibility of hiding data access latencies.Our simulation platform allows to emulate a heterogeneous distributed system by introducing a slowdown between 1.2 and 3.7 times compared to the compute kernel simulation alone.

[1]  Daniel M. Dreps,et al.  IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI , 2018, IBM J. Res. Dev..

[2]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[3]  Christian Steger,et al.  A software performance simulation methodology for rapid system architecture exploration , 2008, 2008 15th IEEE International Conference on Electronics, Circuits and Systems.

[4]  Mathieu Jan,et al.  JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid , 2001, Scalable Comput. Pract. Exp..

[5]  Kenneth B. Kent,et al.  Simulation-Based Circuit-Activity Estimation for FPGAs Containing Hard Blocks , 2017, 2017 International Symposium on Rapid System Prototyping (RSP).

[6]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[7]  Jason Cong,et al.  PARADE: A cycle-accurate full-system simulation Platform for Accelerator-Rich Architectural Design and Exploration , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[9]  Kevin E. Murray,et al.  VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling , 2020 .

[10]  Nicolas Ventroux,et al.  Hybrid Prototyping Methodology for Rapid System Validation in HW/SW Co-Design , 2019, 2019 Conference on Design and Architectures for Signal and Image Processing (DASIP).

[11]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Lieven Eeckhout Heterogeneity in Response to the Power Wall , 2015, IEEE Micro.

[13]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[14]  Wei Zhang,et al.  PAAS: A system level simulator for heterogeneous computing architectures , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[15]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[16]  James C. Hoe,et al.  FIST: A fast, lightweight, FPGA-friendly packet latency estimator for NoC modeling in full-system simulations , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[17]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[18]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Wei Zhang,et al.  HeteroSim: A heterogeneous CPU-FPGA simulator , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20]  James A. Ross,et al.  Implementing OpenSHMEM for the Adapteva Epiphany RISC Array Processor , 2016, ICCS.

[21]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22]  Loïc Cudennec Software-Distributed Shared Memory over Heterogeneous Micro-server Architecture , 2017, Euro-Par Workshops.

[23]  Stefanos Kaxiras,et al.  Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.

[24]  Loïc Cudennec Merging the Publish-Subscribe Pattern with the Shared Memory Paradigm , 2018, Euro-Par Workshops.