Toolchain for Programming, Simulating and Studying the XMT Many-Core Architecture

The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. This paper presents a publicly available tool chain for XMT, complete with a highly configurable cycle-accurate simulator and an optimizing compiler. The XMT tool chain has matured and has been validated to a point where its description merits publication. In particular, research and experimentation enabled by the tool chain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture. The compiler and the simulator are also important milestones for an efficient programmer's workflow from PRAM algorithms to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the dual goal of ease-of-programming and performance. The applicability of our tool chain extends beyond specific XMT choices. It can be used to explore the much greater design space of shared memory many-cores by system researchers or by programmers. As the tool chain can practically run on any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component. Unobstructed by techniques such as decomposition-first and programming for locality, this environment may be useful in deferring the teaching of these techniques, when desired, to more advanced or platform-specific courses.

[1]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[2]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[3]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]  Uzi Vishkin,et al.  A pilot study to compare programming effort for two parallel programming models , 2007, J. Syst. Softw..

[5]  Uzi Vishkin,et al.  Using simple abstraction to reinvent computing for parallelism , 2011, Commun. ACM.

[6]  J. Banks,et al.  Discrete-Event System Simulation , 1995 .

[7]  Ralph Grishman,et al.  The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA 1982.

[8]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[9]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[10]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[11]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[12]  George C. Caragea,et al.  Brief announcement: better speedups for parallel max-flow , 2011, SPAA '11.

[13]  Hyunjin Lee,et al.  TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation , 2008, 2008 37th International Conference on Parallel Processing.

[14]  Fuat Keceli,et al.  Resource-Aware Compiler Prefetching for Many-Cores , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[15]  Uzi Vishkin,et al.  Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract) , 1998, SPAA '98.

[16]  Jiang Zhu,et al.  Building a RCP (Rate Control Protocol) Test Network , 2007 .

[17]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[18]  Uzi Vishkin,et al.  XMT-GPU: A PRAM Architecture for Graphics Computation , 2008, 2008 37th International Conference on Parallel Processing.

[19]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[20]  Gang Qu,et al.  A Mesh-of-Trees Interconnection Network for Single-Chip Parallel Processing , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[21]  Uzi Vishkin,et al.  Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach , 2003, Theory of Computing Systems.

[22]  David A. Bader,et al.  An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs) , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[23]  Uzi Vishkin,et al.  Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism , 2009 .

[24]  A. B. Saybasili HIGHLY PARALLEL MULTI-DIMENSIONAL FAST FOURIER TRANSFORM ON FINE-AND COARSE-GRAINED MANY-CORE APPROACHES , 2022 .

[25]  Sanguthevar Rajasekaran,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2007 .

[26]  Uzi Vishkin,et al.  PRAM-on-chip: first commitment to silicon , 2007, SPAA '07.

[27]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[28]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[29]  Uzi Vishkin,et al.  A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors , 2011, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[30]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[31]  Hans-Juergen Boehm,et al.  HP Laboratories , 2006 .

[32]  Sheng Liang,et al.  Java Native Interface: Programmer's Guide and Reference , 1999 .

[33]  Uzi Vishkin,et al.  Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience , 2010, SIGCSE.

[34]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[35]  Christoph W. Kessler,et al.  Practical PRAM programming , 2000, Wiley series on parallel and distributed computing.

[36]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[37]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[38]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[39]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[40]  Zhengyu He,et al.  Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[41]  Laurie J. Hendren,et al.  SableCC, an object-oriented compiler framework , 1998, Proceedings. Technology of Object-Oriented Languages. TOOLS 26 (Cat. No.98EX176).

[42]  George C. Caragea,et al.  Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor , 2009, SPAA '09.

[43]  Sanguthevar Rajasekaran,et al.  Handbook of Parallel Computing - Models, Algorithms and Applications , 2007 .

[44]  Fuat Keceli,et al.  Power-Performance Comparison of Single-Task Driven Many-Cores , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[45]  Alexandros Tzannes,et al.  The compiler for the XMTC parallel language: Lessons for compiler developers and in-depth description , 2011 .