XMTSim: A Simulator of the XMT Many-core Architecture

This paper documents the features and the design of XMTSim, the cycle-accurate simulator of the Explicit Multi-Threading (XMT) computer architecture. The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. XMTSim is a primary component in its publicly available toolchain along with an optimizing compiler. Research and experimentation enabled by the toolchain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture. The compiler and the simulator are also important milestones for an efficient programmer’s workflow from PRAM algorithms to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the goal of ease-of-programming and performance. The applicability of the XMT simulator extends beyond specific XMT choices. It can be used to explore the much greater design space of shared memory many-cores by system researchers or by programmers. As the toolchain can practically run on any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component. XMTSim is the highly-configurable cycle-accurate simulator of the XMT computer architecture [38, 39, 48, 49]. It is tuned to approximate the behavior of major on-die components of XMT, such as the cores, interconnect and on-chip caches. Additionally XMTSim features a power model and a thermal model, and it provides means to simulate dynamic power and thermal management algorithms. We made XMTSim publicly available as a part of the XMT programming toolchain [7,9], which also includes an optimizing compiler [45]. Detailed information on XMT architecture and the programming model can be found in [30]. XMT envisions bringing efficient on-chip parallel programming to the mainstream, and the toolchain is instrumental in obtaining results to validate these claims, as well as making a simulated XMT platform accessible from any personal computer. XMTSim is useful to a range of communities such as system architects, teachers of parallel programming and algorithm developers due to the following four reasons: 1. Opportunity to evaluate alternative system components. XMTSim allows users to change the parameters of the simulated architecture including the number of functional units and organization of the parallel cores. It is also easy to add new functionality to the simulator, making it the ideal platform for evaluating both architectural extensions and algorithmic improvements that depend on the availability of hardware resources. For example, Caragea, et. al [8] searches for the optimal size and replacement policy for prefetch buffers given limited transistor resources. Furthermore, to our knowledge, XMTSim is the only publicly available many-core simulator that allows evaluation of architectural mechanisms/features, such as dynamic power and thermal management. Finally, the capabilities of our toolchain extend beyond specific XMT choices: system architects can use it to explore a much greater design-space of shared memory many-cores. 2. Performance advantages of XMT and PRAM algorithms. A number of publications [6, 10, 15, 16, 17, 18, 19, 41] list the performance advantages of XMT compared to exiting parallel architectures, and also document the interest of the academic community in such results. XMTSim was the enabling factor for the publications that investigate planned/future configurations. Moreover, despite past doubts in the practical relevance of PRAM algorithms, results facilitated by the toolchain showed not only that theory-based algorithms can provide good speedups in practice, but that sometimes they are the only ones to do so. 3. Teaching and experimenting with on-chip parallel programming. As a part of the XMT toolchain, XMTSim contributed to the experiments that established the ease-of-programming of XMT. These experiments were presented in publications [23, 40, 44, 46] and conducted in courses taught to graduate, undergraduate, high-school and middle-school students including at Thomas Jefferson High School, Alexandria, VA. The curriculum at Thomas Jefferson High School has featured XMT programming since 2008; more than two hundred of its students have already programmed XMT and in 2012 forty of these high-school students demonstrated Ph.D. level parallel programming [14]. In addition, the XMT toolchain provides convenient platform for teaching parallel algorithms and programming, because students can install and use it on any personal computer to work on their assignments.

[1]  Uzi Vishkin,et al.  Brief announcement: speedups for parallel graph triconnectivity , 2012, SPAA '12.

[2]  Laurie J. Hendren,et al.  SableCC, an object-oriented compiler framework , 1998, Proceedings. Technology of Object-Oriented Languages. TOOLS 26 (Cat. No.98EX176).

[3]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[4]  Uzi Vishkin,et al.  Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach , 2003, Theory of Computing Systems.

[5]  A. B. Saybasili HIGHLY PARALLEL MULTI-DIMENSIONAL FAST FOURIER TRANSFORM ON FINE-AND COARSE-GRAINED MANY-CORE APPROACHES , 2022 .

[6]  Uzi Vishkin,et al.  Truly parallel burrows-wheeler compression and decompression , 2013, SPAA.

[7]  Pascal Vivet,et al.  Power Modeling in SystemC at Transaction Level, Application to a DVFS Architecture , 2008, 2008 IEEE Computer Society Annual Symposium on VLSI.

[8]  Kevin Skadron,et al.  Compact thermal modeling for temperature-aware design , 2004, Proceedings. 41st Design Automation Conference, 2004..

[9]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Kevin Skadron,et al.  Many-core design from a thermal perspective , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[12]  Margaret Martonosi,et al.  Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data , 2003, MICRO.

[13]  J. Banks,et al.  Discrete-Event System Simulation , 1995 .

[14]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[15]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[16]  George C. Caragea,et al.  Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor , 2009, SPAA '09.

[17]  Uzi Vishkin,et al.  Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience , 2010, SIGCSE.

[18]  Uzi Vishkin,et al.  A pilot study to compare programming effort for two parallel programming models , 2007, J. Syst. Softw..

[19]  Hyunjin Lee,et al.  TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation , 2008, 2008 37th International Conference on Parallel Processing.

[20]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[21]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[22]  George C. Caragea,et al.  Brief announcement: better speedups for parallel max-flow , 2011, SPAA '11.

[23]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[24]  Kevin Skadron,et al.  An Improved Block-Based Thermal Model in HotSpot 4.0 with Granularity Considerations , 2007 .

[25]  Uzi Vishkin,et al.  Better speedups using simpler parallel programming for graph connectivity and biconnectivity , 2012, PMAM '12.

[26]  Sheng Liang,et al.  Java Native Interface: Programmer's Guide and Reference , 1999 .

[27]  Uzi Vishkin,et al.  A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Fuat Keceli,et al.  Resource-Aware Compiler Prefetching for Many-Cores , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[29]  David Parello,et al.  Barra, a Modular Functional GPU Simulator for GPGPU , 2009 .

[30]  Fuat Keceli,et al.  Thermal Management of a Many-Core Processor under Fine-Grained Parallelism , 2011, Euro-Par Workshops.

[31]  Li Zhao,et al.  Exploring Large-Scale CMP Architectures Using ManySim , 2007, IEEE Micro.

[32]  Uzi Vishkin,et al.  PRAM-on-chip: first commitment to silicon , 2007, SPAA '07.

[33]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[34]  Fuat Keceli,et al.  Power-Performance Comparison of Single-Task Driven Many-Cores , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[35]  Uzi Vishkin,et al.  Empirical Speedup Study of Truly Parallel Data Compression , 2013 .

[36]  Alexandros Tzannes,et al.  The compiler for the XMTC parallel language: Lessons for compiler developers and in-depth description , 2011 .

[37]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[39]  Luca Benini,et al.  HW-SW emulation framework for temperature-aware design in MPSoCs , 2008, TODE.

[40]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[41]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Fuat Keceli,et al.  Power and Performance studies of the Explicit Multi-Threading (XMT) Architecture , 2011 .

[43]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[44]  Zheng Li,et al.  A Very Fast Simulator for Exploring the Many-Core Future , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[45]  Uzi Vishkin,et al.  Evaluating the XMT parallel programming model , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[46]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.