Impact of heterogeneity on DSM performance

This paper explores area/parallelism tradeoffs in the design of distributed shared-memory (DSM) multiprocessors built out of large single-chip computing nodes. In this context, area-efficiency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed for multi-thread parallelism. Quantitative performance of such organization is reported for a set of homogeneous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source code modifications via static thread assignment policies. Simulation-based analysis is used to compare the performance of heterogeneous and homogeneous DSMs that occupy the same silicon area. The analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms its homogeneous counterpart with 4 processors by an average age of 36% for the studied multiprocessor workload, while having the same performance for sequential codes. A sensitivity analysis based on a factorial design experiment is used to study the implications of processor, memory, and network heterogeneity on overall cost and performance of a heterogeneous DSM. The studied benchmarks are affected, on average, primarily by heterogeneity in processor performance (59.3%), followed by cache sizes (18.2%), memory latency (14.6%), and network latency (5.6%).

[1]  Virgílio A. F. Almeida,et al.  Cost-performance analysis of heterogeneity in supercomputer architectures , 1990, Proceedings SUPERCOMPUTING '90.

[2]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[3]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[4]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[5]  Burton M. Leary,et al.  A 200 MHz 64 b dual-issue CMOS microprocessor , 1992, 1992 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[6]  David B. Gustavson The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[7]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[8]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[9]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[10]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[11]  J. Carter,et al.  An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12]  Multiscalar processors , 1995, ISCA 1995.

[13]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[14]  Kunle Olukotun,et al.  Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, ISCA.

[15]  R. Eigenmann,et al.  Hierarchical processors-and-memory architecture for high performance computing , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[16]  Richard E. Overill,et al.  Heterogeneous Computing Machines and Amdahl's Law , 1996, Parallel Comput..

[17]  Dennis Reil Forum. , 1996, Environmental health perspectives.

[18]  K. Olukotun,et al.  Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  José A. B. Fortes,et al.  A heterogeneous hierarchical solution to cost-efficient high performance computing , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[20]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[22]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[23]  Mikko H. Lipasti,et al.  Superspeculative Microarchitecture for Beyond AD 2000 , 1997, Computer.

[24]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[25]  VALERIE TAYLOR A Simulation-based Coste ciency Study of Hierarchical Heterogeneous Machines for Compiler-and Hand-Parallelized Applications , 1997 .

[26]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[27]  T.H. Lee,et al.  A 600 MHz superscalar RISC microprocessor with out-of-order execution , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[28]  James E. Smith,et al.  Trace Processors: Moving to Fourth-Generation Microarchitectures , 1997, Computer.

[29]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[30]  Sarita V. Adve,et al.  The impact of instruction-level parallelism on multiprocessor performance and simulation methodology , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[31]  Renato J. O. Figueiredo,et al.  Spatial Data Locality with Respect to Degree of Parallelism in Processor-and-Memory Hierarchies , 1998, VECPAR.

[32]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[33]  On the Cost-efficiency of Hierarchical Heterogeneous Machines for Compiler- and Hand-Parallelized Applications , 1998 .

[34]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[35]  Stamatis Vassiliadis,et al.  Parallel Computer Architecture , 2000, Euro-Par.

[36]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..