The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program. If truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the application must be tuned so that the working set fits in the caches. This is unfortunate because the most important obstacle to widespread use of parallel computing is the hardship of programming parallel machines. The goal of the I-ACOMA project is to explore how to design a highly programmable high-performance multiprocessor. The authors focus on a flat-coma scalable multiprocessor supported by a parallelizing compiler. The main issues that they are studying are advanced processor organizations. Techniques to handle long memory access latencies, and support for important classes of workloads like databases and scientific applications with loops that cannot be compiler analyzed. The project also involves building a prototype that includes some of the features discussed.

[1]  Lawrence Rauchwerger,et al.  The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization , 1994, ICS '94.

[2]  Lawrence Rauchwerger,et al.  Effective Automatic Parallelization with Polaris , 1995 .

[3]  P. Sadayappan,et al.  An approach to synchronization for parallel computing , 1988, ICS '88.

[4]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[5]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[6]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[7]  Josep Torrellas,et al.  Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[9]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[10]  Josep Torrellas,et al.  Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching , 1995, ISCA.

[11]  John Zahorjan,et al.  Improving the performance of runtime parallelization , 1993, PPOPP '93.

[12]  Andreas Nowatzyk,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, ISCA.

[13]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[14]  Pen-Chung Yew,et al.  Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[15]  Josep Torrellas,et al.  Optimizing instruction cache performance for operating system intensive workloads , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[16]  Josep Torrellas,et al.  The Augmint multiprocessor simulation toolkit for Intel x86 architectures , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[17]  Ding-Kai Chen,et al.  An Eecient Algorithm for the Run-time Parallelization of Doacross Loops 1 , 1994 .

[18]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[19]  Jim Gray,et al.  Advantages of COMA , 1995 .

[20]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[21]  Chuan-Qi Zhu,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Transactions on Software Engineering.

[22]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[23]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[24]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  Josep Torrellas,et al.  Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses , 1996, ISCA.

[26]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[27]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[28]  Anders Landin,et al.  Bus-based COMA-reducing traffic in shared-bus multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.