Reduced Interprocessor-Communication Architecture and its Implementation on EM-4

Abstract One of the most significant issues in building general purpose massively parallel computers is the integration of computation and communication in an efficient and cost-effective manner. This paper presents a way of integrating computation and communication from the viewpoint of the processor architecture. Two statements will be presented and examined: (1) computation and communication should be tightly coupled and their operation should be highly overlapped; and (2) communication structure should be efficient and simple, i.e. turnaround from data input to execution should be as short as possible using a simplified message handling mechanism. Briefly we can say ‘Fuse communication and computation, then reduce the fused structure as simple and efficient as possible’. The fused structure is called RICA, Reduced Interprocessor-Communication Architecture, in this paper. The word ‘Reduced’ here means the simplified structure of message handling, invocation of a new thread, computation and message generation. Based on the RICA design principles, the authors have developed the EM-series parallel computers, EM-4, EM-X and EM-5. This paper concentrates on the architecture and implementation of EM-4, whose first prototype has been fully operational since April 1990. The communication performance of EM-4 is comparable with its computation performance. These two primitives are efficiently and simply fused within the EM-4 architecture, i.e. a simple fused pipeline which performs message handling, instruction execution and packet output: message-handling time is two RISC clocks which is independent of executing processor instructions. This pipeline naturally includes a sequential RISC pipeline for executing local operations. Secondly this paper evaluates RICA by implementing on EM-4 several programming models generally considered ‘effective’ or ‘promising’. The multi-threaded model, message passing model and data-parallel model have been implemented and a shared-memory model is being implemented on EM-4. Performance of communication primitives for each model is measured on EM-4 prototype and reported here.

[1]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[2]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[3]  Geoffrey C. Fox,et al.  Communication overhead on the CM5: an experimental performance evaluation , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[4]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[5]  Andrew A. Chien,et al.  J-machine: A fine-grain concurrent computer , 1989 .

[6]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[7]  Mitsuhisa Sato,et al.  EMC-Y: parallel processing element optimizing communication and computation , 1993, ICS '93.

[8]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[9]  M. J. Beckerle,et al.  T: integrated building blocks for parallel computing , 1993, Supercomputing '93.

[10]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[11]  Shuichi Sakai,et al.  Prototype implementation of a highly parallel dataflow machine EM-4 , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[12]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[13]  Shuichi Sakai,et al.  A prototype of a highly parallel dataflow machine EM-4 and its preliminary evaluation , 1992, Future Gener. Comput. Syst..

[14]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[15]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[16]  William J. Dally,et al.  Processor coupling: integrating compile time and runtime scheduling for parallelism , 1992, ISCA '92.

[17]  Satoshi Matsuoka,et al.  ABCL/onEM-4: a new software/hardware architecture for object-oriented concurrent computing on an extended dataflow supercomputer , 1992, ICS '92.

[18]  Shuichi Sakai,et al.  Design and Implementation of a Circular Omega Network in the EM-4 , 1993, Parallel Comput..

[19]  Mitsuhisa Sato,et al.  Thread-based programming for the EM-4 hybrid dataflow machine , 1992, ISCA '92.

[20]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[21]  Arvind,et al.  Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.