Super-threading: architectural and software mechanisms for optimizing parallel computation

This paper presents super-threading, which generically means the architectural and software mechanisms for optimizing parallel computation. Super-threading includes architectural optimization of a processing element (PE), mechanism for supporting fast communication and computation, techniques of a compiler and a run time system for optimizing thread creation, thread allocation, tuning of granularity and data allocation to physically distributed storage. This paper states what super-threading is and examines some of the technologies belonging to it. The processor architecture based on super-threading is proposed and its implementation on a highly parallel computer EM-4 is shown with performance data. Software issues about super-threading are also examined mainly from the viewpoint of granularity optimization. Dynamic granularity optimization methods are proposed here, and evaluated on EM-4. The performance data indicate that super-threading is a key technology for realizing an efficient massively parallel computer.

[1]  Shuichi Sakai,et al.  A prototype of a highly parallel dataflow machine EM-4 and its preliminary evaluation , 1992, Future Gener. Comput. Syst..

[2]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[3]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[4]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[5]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[6]  Andrew A. Chien,et al.  The J-Machine: A Fine Grain Concurrent Computer , 1989 .

[7]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[8]  Mitsuhisa Sato,et al.  Thread-based programming for the EM-4 hybrid dataflow machine , 1992, ISCA '92.

[9]  Shuichi Sakai,et al.  Design and Implementation of a Circular Omega Network in the EM-4 , 1993, Parallel Comput..

[10]  Shuichi Sakai,et al.  Prototype implementation of a highly parallel dataflow machine EM-4 , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[11]  Mitsuhisa Sato,et al.  EMC-Y: parallel processing element optimizing communication and computation , 1993, ICS '93.

[12]  Shuichi Sakai,et al.  Load balancing by function distribution on the EM-4 prototype , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[13]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[14]  William J. Dally,et al.  Processor coupling: integrating compile time and runtime scheduling for parallelism , 1992, ISCA '92.