Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation-the latency it adds and the contention it induces-it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

[1]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[2]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[3]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[4]  R. Karp,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[5]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[6]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[7]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[8]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[9]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[10]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[13]  Dennis G. Shea,et al.  The SP2 High-Performance Switch , 1995, IBM Syst. J..

[14]  David A. Wood,et al.  Cost-Effective Parallel Computing , 1995, Computer.

[15]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[18]  Jaswinder Pal Singh,et al.  The effects of communication parameters on end performance of shared virtual memory clusters , 1997, SC '97.

[19]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[20]  Maged M. Michael,et al.  Coherence Controller Architectures For Smp-based Cc-numa Multiprocessors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Mary K. Vernon,et al.  Analytic evaluation of shared-memory systems with ILP processors , 1998, ISCA.

[23]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[24]  Rajeev Barua,et al.  The sensitivity of communication mechanisms to bandwidth and latency , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[25]  Mark Heinrich,et al.  FLASH vs. (simulated) FLASH: closing the simulation loop , 2000, SIGP.

[26]  Evan Speight Providing Hardware Dsm Performance at Software Dsm Cost Providing Hardware Dsm Performance at Software Dsm Cost , 2000 .

[27]  Weisong Shi,et al.  A novel multicast scheme to reduce cache invalidation overheads in DSM systems , 2000, Conference Proceedings of the 2000 IEEE International Performance, Computing, and Communications Conference (Cat. No.00CH37086).

[28]  Maged M. Michael,et al.  High-throughput coherence controllers , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[29]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.