Parallel Communication Mechanisms for Sparse, Irregular Applications

Parallel systems are becoming a significant computing technology, not only for high performance computing, but also for commodity servers. The goal of this research is to identify core communication mechanisms which both exploit architectural trends and support real applications. We demonstrate that cache-coherent shared memory hardware is such a core mechanism, even for applications with little data re-use and data-driven synchronization. This thesis makes three major contributions. First, we perform an in-depth study of the interaction between communication mechanisms and sparse, irregular applications. Second, we present the Remote Queues (RQ) communication model, an abstraction which synthesizes more efficient synchronization for hardware-supported shared memory and other complex systems. Third, we characterize the relative performance of all of our mechanisms as processor speed and machine size scale. On the MIT Alewife Multiprocessor, we find that shared memory provides high performance with lower code complexity than message passing on our irregular problems. This is primarily due to four reasons. First, a 5-to-1 ratio between global and local cache misses makes memory copies in bulk communication expensive relative to communication via shared memory. Second, although message passing has synchronization semantics superior to shared memory for data-driven computation, efficient shared memory can overcome this handicap by using global read-modify-writes to change from the traditional owner-computes model to a producer-computes model. Third, the Remote Queues communication model generalizes such a change, providing the semantics and performance of polling active messages on a wide variety of systems. Fourth, bulk transfers can result in high processor idle times in irregular applications. Finally, we characterize multiprocessor design points where message passing and bulk transfer can perform better than shared memory. In particular, we find that shared memory uses more than four times the network bandwidth as message passing. Unless an application’s performance is already limited by local memory speeds, network bandwidth and latency threaten to become a serious problem. Our study indicates that machines based on modern microprocessors, such as the Cray T3E, must resort to expensive, high-dimensional networks to support shared-memory traffic. Furthermore, the round-trip nature of shared memory may not be able to tolerate the latencies of future networks.

[1]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[2]  William J. Dally,et al.  The M-machine multicomputer , 1997, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[3]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[4]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[5]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[6]  J. Meijerink,et al.  An iterative solution method for linear systems of which the coefficient matrix is a symmetric -matrix , 1977 .

[7]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[8]  T. W. Mathews,et al.  Analysis of performance accelerator running ETMSP. Final report , 1993 .

[9]  Duncan Roweth The Meiko CS-2 system architecture , 1993, SPAA '93.

[10]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[11]  Wilson C. Hsieh,et al.  Optimistic active messages: a mechanism for scheduling communication with computation , 1995, PPOPP '95.

[12]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[13]  Shubhendu S. Mukherjee,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  I. Gustafsson On modified incomplete cholesky factorization methods for the solution of problems with mixed boundary conditions and problems with discontinuous material conefficients , 1979 .

[15]  Seth Copen Goldstein,et al.  Evaluation of mechanisms for fine-grained parallel programs in the J-machine and the CM-5 , 1993, ISCA '93.

[16]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[17]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[18]  David E. Culler,et al.  Assessing the benefits of fine-grain parallelism in dataflow programs , 1988, Proceedings. SUPERCOMPUTING '88.

[19]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[20]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[21]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[22]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[23]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[24]  Ricardo Bianchini,et al.  Application Performance on the MIT Alewife Multiprocessor , 1996 .

[25]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[26]  Shahid H. Bokhari,et al.  A Partitioning Strategy for PDEs Across Multiprocessors , 1985, ICPP.

[27]  Steven A. Moyer,et al.  Performance of the IPSC/860 Node Architecture , 1991 .

[28]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[29]  T. H. Dunigan Communication performance of the Intel Touchstone DELTA mesh , 1992 .

[30]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[31]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[32]  J. M. Aarden,et al.  Preconditioned CG-type methods for solving the coupled system of fundamental semiconductor equations , 1989 .

[33]  Mark T. Jones,et al.  BlockSolve v1. 1: Scalable Library Software for the Parallel Solution of Sparse Linear Systems , 1993 .

[34]  Frederic T. Chong,et al.  METRO: a router architecture for high-performance, short-haul routing networks , 1994, ISCA '94.

[35]  Evangelos P. Markatos,et al.  Shared memory vs. message passing in shared-memory multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[36]  Anant Agarwal,et al.  FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor , 1994 .

[37]  Eric A. Brewer,et al.  How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.

[38]  R. Schreiber,et al.  Highly Parallel Sparse Triangular Solution , 1994 .

[39]  Lawrence Snyder,et al.  A Comparison of Programming Models for Shared Memory Multiprocessors , 1990, ICPP.

[40]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[41]  Chris J. Scheiman,et al.  Experience with active messages on the Meiko CS-2 , 1995, Proceedings of 9th International Parallel Processing Symposium.

[42]  Shreekant S. Thakkar,et al.  The Symmetry Multiprocessor System , 1988, ICPP.

[43]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[44]  M. J. Beckerle,et al.  T: integrated building blocks for parallel computing , 1993, Supercomputing '93.

[45]  N. Madsen Divergence preserving discrete surface integral methods for Maxwell's curl equations using non-orthogonal unstructured grids , 1995 .

[46]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .

[47]  Thomas H. Dunigan KENDALL SQUARE MULTIPROCESSOR: EARLY EXPERIENCES AND PERFORMANCE , 1992 .

[48]  Owe Axelsson,et al.  A survey of preconditioned iterative methods for linear systems of algebraic equations , 1985 .

[49]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[50]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[51]  Robert,et al.  Parallel Sparse Triangular Solution with Partitioned Inverses andPrescheduled , 1995 .

[52]  James R. Larus,et al.  Where is time spent in message-passing and shared-memory programs? , 1994, ASPLOS VI.

[53]  Anoop Gupta,et al.  An efficient block-oriented approach to parallel sparse Cholesky factorization , 1993, Supercomputing '93. Proceedings.

[54]  Donald Yeung,et al.  Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient , 1993, PPOPP '93.

[55]  Margaret Martonosi,et al.  Tradeoffs in Message Passing and Shared Memory Implementations of a Standard Cell Router , 1989, ICPP.

[56]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .