论文信息 - Impact of heterogeneity on DSM performance

Impact of heterogeneity on DSM performance

This paper explores area/parallelism tradeoffs in the design of distributed shared-memory (DSM) multiprocessors built out of large single-chip computing nodes. In this context, area-efficiency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed for multi-thread parallelism. Quantitative performance of such organization is reported for a set of homogeneous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source code modifications via static thread assignment policies. Simulation-based analysis is used to compare the performance of heterogeneous and homogeneous DSMs that occupy the same silicon area. The analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms its homogeneous counterpart with 4 processors by an average age of 36% for the studied multiprocessor workload, while having the same performance for sequential codes. A sensitivity analysis based on a factorial design experiment is used to study the implications of processor, memory, and network heterogeneity on overall cost and performance of a heterogeneous DSM. The studied benchmarks are affected, on average, primarily by heterogeneity in processor performance (59.3%), followed by cache sizes (18.2%), memory latency (14.6%), and network latency (5.6%).

Renato J. O. Figueiredo | José A. B. Fortes | R. Figueiredo | J. Fortes

[1] Virgílio A. F. Almeida,et al. Cost-performance analysis of heterogeneity in supercomputer architectures , 1990, Proceedings SUPERCOMPUTING '90.

[2] Per Stenström,et al. A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[3] P. Stenstrom. A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[4] Ray Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[5] Burton M. Leary,et al. A 200 MHz 64 b dual-issue CMOS microprocessor , 1992, 1992 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[6] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[7] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[8] James R. Larus,et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[9] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[10] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[11] J. Carter,et al. An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12] Multiscalar processors , 1995, ISCA 1995.

[13] Yunheung Paek,et al. Parallel Programming with Polaris , 1996, Computer.

[14] Kunle Olukotun,et al. Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, ISCA.

[15] R. Eigenmann,et al. Hierarchical processors-and-memory architecture for high performance computing , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[16] Richard E. Overill,et al. Heterogeneous Computing Machines and Amdahl's Law , 1996, Parallel Comput..

[17] Dennis Reil. Forum. , 1996, Environmental health perspectives.

[18] K. Olukotun,et al. Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19] José A. B. Fortes,et al. A heterogeneous hierarchical solution to cost-efficient high performance computing , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[20] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21] Noah Treuhaft,et al. Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[22] Yale N. Patt,et al. One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[23] Mikko H. Lipasti,et al. Superspeculative Microarchitecture for Beyond AD 2000 , 1997, Computer.

[24] Kunle Olukotun,et al. A Single-Chip Multiprocessor , 1997, Computer.

[25] VALERIE TAYLOR. A Simulation-based Coste ciency Study of Hierarchical Heterogeneous Machines for Compiler-and Hand-Parallelized Applications , 1997 .

[26] Vivek Sarkar,et al. Baring It All to Software: Raw Machines , 1997, Computer.

[27] T.H. Lee,et al. A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[28] James E. Smith,et al. Trace Processors: Moving to Fourth-Generation Microarchitectures , 1997, Computer.

[29] Dean M. Tullsen,et al. Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[30] Sarita V. Adve,et al. The impact of instruction-level parallelism on multiprocessor performance and simulation methodology , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[31] Renato J. O. Figueiredo,et al. Spatial Data Locality with Respect to Degree of Parallelism in Processor-and-Memory Hierarchies , 1998, VECPAR.

[32] Kunle Olukotun,et al. Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[33] On the Cost-efficiency of Hierarchical Heterogeneous Machines for Compiler- and Hand-Parallelized Applications , 1998 .

[34] Thorsten von Eicken,et al. 技術解説 IEEE Computer , 1999 .

[35] Stamatis Vassiliadis,et al. Parallel Computer Architecture , 2000, Euro-Par.

[36] James R. Larus,et al. Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..