An approach to build cycle accurate full system VLIW simulation platform

Abstract Very long instruction word (VLIW) architecture is widely used in the design of digital signal processors (DSPs) and application-specific processors because of its hardware simplicity and high efficiency. Some heterogeneous systems also use VLIW style accelerators to achieve high computing performance and power efficiency. However, there are few widely accepted simulators that can cycle-accurately model a VLIW architecture or simulate the entire heterogeneous system with VLIW accelerators. In this paper we present an approach to build cycle accurate full system VLIW simulation platform. The basic idea is to analyze Petri Nets modeling used in the traditional cycle accurate simulation and adjust it to match VLIW architecture. The adjustments reconstruct and optimize the colored token, the place and the arc in Petri Nets in order to adapt it with VLIW characteristics. According to our approach and based on the InOrder simulator in the open source simulator framework Gem5, we build a heterogeneous multicore full system simulator for the MaPU (Mathematical Processor Unit) chip, which is composed by a functional accurate ARM simulator and a cycle accurate accelerator simulator. To evaluate the performance and accuracy of our simulator ‘Gem5-MaPU’, we compare the results of a set of DSP benchmarks executed by both the simulator and the RTL model. The result shows our simulator is about 1000 times faster than the RTL model while the cycle error is reduced to less than 5%. With high accuracy rate and good accelerating ratio over RTL simulation, the cycle accurate simulator turns out to be an efficient and flexible tool for VLIW related architectures’ study and development, such as the hardware-software co-design and performance evaluation, etc.

[1]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[2]  Milos Becvár,et al.  VLIW-DLX simulator for educational purposes , 2007, WCAE '07.

[3]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[4]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[5]  Hamid Hajihosseini,et al.  Importance of Simulation in Manufacturing , 2009 .

[6]  Rami R. Razouk The use of Petri nets for modeling pipelined processors , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[7]  Stefanos Kaxiras,et al.  Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads , 2001, CASES '01.

[8]  Xiaoming Li,et al.  A Micro-benchmark Suite for AMD GPUs , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[9]  Bin Li,et al.  Performance and Power Analysis of ATI GPU: A Statistical Approach , 2011, 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage.

[10]  Gilles Sassatelli,et al.  Accuracy evaluation of GEM5 simulator system , 2012, 7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC).

[11]  C. Ramchandani,et al.  Analysis of asynchronous concurrent systems by timed petri nets , 1974 .

[12]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  David Misunas,et al.  Petri nets and speed independent design , 1973, Commun. ACM.

[14]  Trudy D. Stetzler,et al.  DSP-based architectures for mobile communications: past, present and future , 2000, IEEE Commun. Mag..

[15]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[16]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[17]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[18]  Sied Mehdi Fakhraie,et al.  An Efficient and Extendable Modeling Approach for VLIW DSP Processors , 2008, CSICC.

[19]  Alberto L. Sangiovanni-Vincentelli,et al.  Software timing analysis using HW/SW cosimulation and instruction set simulator , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[20]  N. Seshan High VelociTI processing [Texas Instruments VLIW DSP architecture] , 1998 .

[21]  Naohito Nakasato,et al.  A fast GEMM implementation on the cypress GPU , 2011, PERV.

[22]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[23]  Tilak Agerwala,et al.  Special Feature: Putting Petri Nets to Work , 1979, Computer.

[24]  René David,et al.  Petri nets for modeling of dynamic systems: A survey , 1994, Autom..

[25]  Jianxin Yang,et al.  Simple-VLIW: A fundamental VLIW architectural simulation platform , 2008, 2008 Asia Simulation Conference - 7th International Conference on System Simulation and Scientific Computing.

[26]  Chen Lin,et al.  MaPU: A novel mathematical computing architecture , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Mostafa I. Soliman A VLIW architecture for executing multi-scalar/vector instructions on unified datapath , 2013, 2013 Saudi International Electronics, Communications and Photonics Conference.