Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture

This paper presents the initial design of the Cyclops-64 (C64) system software infrastructure and tools under development as a joint effort between IBM T.J. Watson Research Center, ETI Inc. and the University of Delaware. The C64 system is the latest version of the Cyclops cellular architecture that consists of a large number of compute nodes each employs a multiprocessor-on-a-chip architecture with 160 hardware thread units. The first version of the C64 system software has been developed and is now under evaluation. The current version of the C64 software infrastructure includes a C64 toolchain (compiler, linker, functionally accurate simulator, runtime thread library, etc.) and other tools for system control (system initialization, diagnostics and recovery, job scheduler, program launching, etc.) This paper focuses on the following aspects of the C64 system software: (1) the C64 software toolchain; (2) the C64 Thread Virtual Machine (C64 TVM) with emphasis on TiNy ThreadsTM, the implementation of the C64 TVM; (3) the system software for host control. In addition, we illustrate, through two case studies, what an application developer can expect from the C64 architecture as well as some advantages of this architecture, in particular, how it provides a cost-effective solution. A C64 chip’s performance varies across different applications from 5 to 35 times faster than common off-the-self microprocessors.

[1]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[3]  Thomas L. Sterling,et al.  Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4]  James R. Goodman,et al.  Billion-transistor architectures: there and back again , 2004, Computer.

[5]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[6]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[7]  José E. Moreira,et al.  A Volumetric FFT for BlueGene/L , 2003, HiPC.

[8]  Thomas L. Sterling An Introduction to the Gilgamesh PIM Architecture , 2001, Euro-Par.

[9]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[10]  D. Burger,et al.  Billion-Transistor Architectures , 1997, Computer.

[11]  Jon Stearley Towards a Specification for Measuring Red Storm Reliability, Availability, and Serviceability (RAS) , 2005 .

[12]  Thomas L. Sterling The Gilgamesh MIND Processor-in-Memory Architecture for Petaflops-Scale Computing , 2002, ISHPC.

[13]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[15]  José E. Moreira,et al.  An Overview of the Blue Gene/L System Software Organization , 2003, Euro-Par.

[16]  Vivek Sarkar,et al.  Location Consistency-A New Memory Model and Cache Consistency Protocol , 2000, IEEE Trans. Computers.

[17]  Robert S. Germain,et al.  Blue Matter, an application framework for molecular simulation on Blue Gene , 2003, J. Parallel Distributed Comput..

[18]  George L.-T. Chiu,et al.  Blue Gene/L, a system-on-a-chip , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[19]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[20]  Hirofumi Sakane,et al.  DIMES: an iterative emulation platform for Multiprocessor-System-On-Chip designs , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[21]  K. Olukotun,et al.  Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[22]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[23]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[24]  Guang R. Gao,et al.  Lamport Order Revisit: a Study on How to Eeciently Achieve Sequential Consistency on a Modern Multiprocessor-on-a-chip Architecture , 2006 .

[25]  José E. Moreira,et al.  Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer , 2003, Euro-Par.

[26]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[27]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[28]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[29]  Ron Brightwell,et al.  Scalable parallel application launch on Cplant , 2001, SC.

[30]  Guang R. Gao,et al.  Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip , 2006, CF '06.

[31]  R. Brightwell,et al.  Scalable Parallel Application Launch on Cplant ™ , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[32]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[33]  G. Gao,et al.  FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[34]  José E. Moreira,et al.  Evaluation of a multithreaded architecture for cellular computing , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[35]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[36]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .