Shared-memory multiprocessing: Current state and future directions

Abstract Progress in shared-memory multiprocessing research over the last several decades has led to its industrial recognition as a key technology for a variety of performance-demanding application domains. In this chapter, we summarize the current state of this technology including system architectures, programming interfaces, and compiler and tool technology offered to the application writer. We the identify important issues for future research in relation to technology and application trends. We particularly focus on research directions in machine architectures, programming interfaces, and parallelization methodologies.

[1]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[2]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[3]  Josep Torrellas,et al.  Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[4]  Erik Hagersten,et al.  Simple COMA node implementations , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[5]  Per Stenström,et al.  Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols , 1996, TOPL.

[6]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[7]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[8]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[9]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[10]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[11]  Alan L. Cox,et al.  Software versus hardware shared-memory implementation: a case study , 1994, ISCA '94.

[12]  Josep Torrellas,et al.  The memory performance of DSS commercial workloads in shared-memory multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[13]  Margaret Martonosi,et al.  Integrating performance monitoring and communication in parallel computers , 1996, SIGMETRICS '96.

[14]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[15]  Allan Gottlieb Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, Australia, May 1992 , 1992, ISCA.

[16]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[17]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[18]  Stein Gjessing,et al.  Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[19]  Kemal Ebcioglu,et al.  An efficient resource-constrained global scheduling technique for superscalar and VLIW processors , 1992, MICRO 1992.

[20]  David J. Lilja,et al.  Coarse-grained speculative execution in shared-memory multiprocessors , 1998, ICS '98.

[21]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[22]  R.E. Johnson,et al.  Evaluation of Multithreaded Uniprocessors for Commercial Application Environments , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[23]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[24]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[25]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[26]  B.P. Miller DPM: A Measurement System for Distributed Programs , 1988, IEEE Trans. Computers.

[27]  Håkan Grahn,et al.  SimICS/Sun4m: A Virtual Workstation , 1998, USENIX Annual Technical Conference.

[28]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[29]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[30]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[31]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[32]  Mary K. Vernon,et al.  The performance of multiprogrammed multiprocessor scheduling algorithms , 1990, SIGMETRICS '90.

[33]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[34]  Andrew Gilliam Tucker,et al.  Efficient Scheduling on Multiprogrammed Shared-Memory Multiprocessors , 1994 .

[35]  Per Stenström,et al.  Reducing Contention in Sharde-Memory Multiprocessors , 1988, Computer.

[36]  Mary W. Hall,et al.  Interprocedural Parallelization Analysis: A Case Study , 1995, PPSC.

[37]  Paul Fischer,et al.  A commercial CFD application on a shared memory multiprocessor using MPI , 1996 .

[38]  Håkan Grahn,et al.  Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection , 1996, J. Parallel Distributed Comput..

[39]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[40]  Jack J. Dongarra,et al.  Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[41]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[42]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[43]  H. Grahn,et al.  Efficient strategies for software-only directory protocols in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[44]  V. K. Naik,et al.  Performance analysis of job scheduling policies in parallel supercomputing environments , 1993, Supercomputing '93.

[45]  Nawaf Bitar,et al.  A Scalable Multi-Discipline, Multiple-Processor Scheduling Framework for IRIX , 1995, JSSPP.

[46]  Michel Dubois,et al.  Boosting the Performance of Shared Memory Multiprocessors , 1997, Computer.

[47]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[48]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[49]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[50]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[51]  Steven Brawer,et al.  An Introduction to Parallel Programming , 1989 .

[52]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[53]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[54]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[55]  Andrea C. Arpaci-Dusseau,et al.  Searching for the sorting record: experiences in tuning NOW-Sort , 1998, SPDT '98.

[56]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[57]  David J. Lilja,et al.  Complexity and performance in parallel programming languages , 1997, Proceedings Second International Workshop on High-Level Parallel Programming Models and Supportive Environments.

[58]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[59]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[60]  T. Brewer,et al.  The evolution of the HP/Convex Exemplar , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[61]  Ken Kennedy,et al.  The parascope editor: an interactive parallel programming tool , 1993, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[62]  Margaret Martonosi,et al.  Characterizing the Memory Behavior of Compiler-Parallelized Applications , 1996, IEEE Trans. Parallel Distributed Syst..

[63]  Zary Segall,et al.  Visualizing performance debugging , 1989, Computer.

[64]  Kozo Kimura,et al.  An elementary processor architecture with simultaneous instruction issuing from multiple threads , 1992, ISCA '92.

[65]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[66]  Per Stenström,et al.  The Scalable Tree Protocol-a cache coherence approach for large-scale multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[67]  Sanjay Sharma,et al.  Impact of Loop Granularity and Self-Preemption on the Performance of Loop Parallel Applications on a Multiprogrammed Shared-Memory Multiprocessor , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[68]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[69]  Margaret Martonosi,et al.  Performance monitoring in a Myrinet-connected SHRIMP cluster , 1998, SPDT '98.

[70]  James E. Smith,et al.  Trace Processors: Moving to Fourth-Generation Microarchitectures , 1997, Computer.

[71]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[72]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[73]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[74]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[75]  Alan Jay Smith Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, USA, May 1993 , 1993, ISCA.

[76]  Joe Throop OpenMP: Shared-Memory Parallelism From the Ashes , 1999, Computer.

[77]  Barr E. Bauer Practical parallel programming , 1992 .

[78]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[79]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[80]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[81]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[82]  Lawrence Rauchwerger,et al.  The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization , 1994, ICS '94.

[83]  CORPORATE Ncube The NCUBE family of high-performance parallel computer systems , 1988, C3P.

[84]  W. E Nagel 1988 International conference on supercomputing , 1988 .

[85]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[86]  David J. Lilja,et al.  Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons , 1993, CSUR.

[87]  Margaret Martonosi,et al.  Adaptive parallelism in compiler‐parallelized code , 1998 .

[88]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[89]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[90]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[91]  Rajeev Barua,et al.  Maps: a compiler-managed memory system for raw machines , 1999, ISCA.

[92]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[93]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[94]  Rudolf Eigenmann,et al.  Parallel programming with message passing and directives , 2001, Comput. Sci. Eng..

[95]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[96]  David A. Patterson,et al.  Proceedings of the 22nd annual international symposium on Computer architecture , 1995, ISCA.

[97]  Richard J. Enbody,et al.  Automatic Self-Allocating Threads (ASAT) on the Convex Exemplar , 1995, ICPP.

[98]  David J. Lilja,et al.  Efficient execution of parallel applications in multiprogrammed multiprocessor systems , 1996, Proceedings of International Conference on Parallel Processing.

[99]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[100]  Margaret Martonosi,et al.  Informing memory operations: memory performance feedback mechanisms and their applications , 1998, TOCS.

[101]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[102]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[103]  Daniel E. Lenoski,et al.  Scalable Shared-Memory Multiprocessing , 1995 .

[104]  Joel H. Saltz,et al.  Resource‐aware metacomputing , 1997 .

[105]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[106]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[107]  Shashi Shekhar,et al.  Parallelizing a GIS on a Shared Address Space Architecture , 1996, Computer.

[108]  Anoop Gupta,et al.  Competitive management of distributed shared memory , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[109]  Clifford C. Huff,et al.  Elements of a realistic CASE tool adoption budget , 1992, CACM.

[110]  Yale N. Patt,et al.  A comparison of dynamic branch predictors that use two levels of branch history , 1993, ISCA '93.

[111]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.