The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation

Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cluster can share data more effectively. We investigate the performance benefits that can be obtained by clustering on a range of important scientific and engineering applications in moderate to large scale cache coherent machines with small degrees of clustering (up to one eighth of the total number of processors in a cluster). We find that except for applications with near neighbor communication topologies this degree of clustering is not very effective in reducing the inherent communication to computation ratios. Clustering is more useful in reducing the the number of remote capacity misses in unstructured applications, and can improve performance substantially when small first-level caches are clustered in these cases. This suggests that clustering at the first level cache might be useful in highly-integrated, relatively fine-grained environments. For less integrated machines such as current distributed shared memory multiprocessors, our results suggest that clustering at the first-level caches is not very useful in improving application performance; however our results also suggest that in an machine with long interprocessor communication latencies, clustering further away from the processor can provide performance benefits.

[1]  Andrew W. Wilson,et al.  Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[2]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[3]  Michael D. Smith,et al.  Tracing with Pixie , 1991 .

[4]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[5]  Anoop Gupta,et al.  The DASH prototype: implementation and performance , 1992, ISCA '92.

[6]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[7]  Anoop Gupta,et al.  Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.

[8]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[9]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[10]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[11]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[12]  Kunle Olukotun,et al.  Exploring the design space for a shared-cache multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[13]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[14]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[15]  Anoop Gupta,et al.  Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[16]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[17]  Hendrik A. Goosen,et al.  Paradigm: a highly scalable shared-memory multicomputer architecture , 1991, Computer.

[18]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19]  Marc Levoy,et al.  Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.