论文信息 - Parallel computer architecture - a hardware / software approach

Parallel computer architecture - a hardware / software approach

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

[1] Eric A. Brewer,et al. Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[2] Wen-Hann Wang,et al. On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[3] Michael Stumm,et al. Cache consistency in hierarchical-ring-based multiprocessors , 1992, Proceedings Supercomputing '92.

[4] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.

[5] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6] Calvin K. Tang. Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[7] Marc Snir,et al. The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[8] James R. Goodman. Using cache memory to reduce processor-memory traffic , 1998, ISCA '98.

[9] R. S. Nikhil. Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[10] Anant Agarwal,et al. APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[11] William J. Dally,et al. The J-machine network , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.

[12] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[13] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[14] Josep Torrellas,et al. False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[15] Alan L. Cox,et al. Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[16] Charles E. Leiserson,et al. Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[17] Jean-Loup Baer,et al. A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[18] Jonathan S. Turner,et al. Design of a broadcast packet switching network , 1988, IEEE Trans. Commun..

[19] Anant Agarwal,et al. Anatomy of a message in the Alewife multiprocessor , 1993 .

[20] F. Leighton,et al. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[21] Richard P. Martin,et al. HPAM: an active message layer for a network of hp workstations , 1994, Symposium Record Hot Interconnects II.

[22] V. Benes,et al. Mathematical Theory of Connecting Networks and Telephone Traffic. , 1966 .

[23] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[24] Peter M. Kogge,et al. EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[25] Michael J. Flynn,et al. Latency Tolerance for Dynamic Processors , 1996 .

[26] John S. Keen,et al. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[27] Jaswinder Pal Singh,et al. A methodology and an evaluation of the SGI Origin2000 , 1998, SIGMETRICS '98/PERFORMANCE '98.

[28] Luiz André Barroso,et al. The performance of cache-coherent ring-based multiprocessors , 1993, ISCA '93.

[29] Alan L. Cox,et al. Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[30] Kenneth E. Batcher. STARAN parallel processor system hardware , 1974, AFIPS '74.

[31] Jack B. Dennis,et al. Data Flow Supercomputers , 1980, Computer.

[32] Monica S. Lam,et al. Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[33] Michael L. Scott,et al. Using memory-mapped network interfaces to improve the performance of distributed shared memory , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[34] Dirk Roose,et al. Benchmarking the iPSC/2 Hypercube Multiprocessor , 1989, Concurr. Pract. Exp..

[35] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[36] Dean M. Tullsen,et al. Limitations of cache prefetching on a bus-based multiprocessor , 1993, ISCA '93.

[37] Report,et al. Public International Benchmarks for Parallel Computers , 1993 .

[38] Richard J. Swan,et al. The implementation of the Cm* multi-microprocessor , 1899, AFIPS '77.

[39] James Cownie,et al. Message Passing on the Meiko CS-2 , 1994, Parallel Comput..

[40] K. Mani Chandy,et al. Parallel program design - a foundation , 1988 .

[41] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[42] Moriyoshi Ohara,et al. Producer-oriented versus consumer-oriented prefetching: a comparison and analysis of parallel application programs , 1996 .

[43] Anant Agarwal,et al. Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[44] Stephen R. Goldschmidt,et al. Simulation of multiprocessors: accuracy and performance , 1993 .

[45] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[46] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[47] Henry Fuchs,et al. Near real-time shaded display of rigid objects , 1983, SIGGRAPH.

[48] Sarita V. Adve,et al. An evaluation of memory consistency models for shared-memory systems with ILP processors , 1996, ASPLOS VII.

[49] Alan Jay Smith,et al. Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[50] S.K. Reinhardt,et al. Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[51] David E. Culler,et al. Active Message Applications Programming Interface , 1996 .

[52] David A. Wood,et al. An in-cache address translation mechanism , 1986, ISCA '86.

[53] Anoop Gupta,et al. The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[54] C. G. Bell. Multis: A New Class of Multiprocessor Computers , 1985, Science.

[55] S. Konstantinidou,et al. Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[56] John B. Carter,et al. An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[57] Susan J. Eggers,et al. Eliminating False Sharing , 1991, ICPP.

[58] David P. Rodgers,et al. Improvements in multiprocessor system design , 1985, ISCA '85.

[59] Burton J. Smith,et al. The Horizon supercomputing system: architecture and software , 1988, Proceedings. SUPERCOMPUTING '88.

[60] Fredrik Dahlgren. Boosting the performance of hybrid snooping cache protocols , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[61] Monica S. Lam,et al. The design and evaluation of a shared object system for distributed memory machines , 1994, OSDI '94.

[62] John L. Hennessy,et al. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[63] William J. Dally,et al. The Named-State Register File: implementation and performance , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[64] E. Biagioni,et al. Designing a practical ATM LAN , 1993, IEEE Network.

[65] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[66] Stefanos Kaxiras,et al. Kiloprocessor Extensions to SCI , 1996, Proceedings of International Conference on Parallel Processing.

[67] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[68] Jean-Loup Baer,et al. An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[69] Truman Joe. COMA-F: a non-hierarchical cache only memory architecture , 1995 .

[70] Harold S. Stone,et al. A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[71] John L. Hennessy,et al. Evaluating the memory overhead required for COMA architectures , 1994, ISCA '94.

[72] Peter J. Denning,et al. The working set model for program behavior , 1968, CACM.

[73] Michael D. Noakes,et al. The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[74] GuptaAnoop,et al. Parallel Visualization Algorithms , 1994 .

[75] Yoichi Koyanagi,et al. AP1000+: architectural support of PUT/GET interface for parallelizing compiler , 1994, ASPLOS VI.

[76] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[77] R. E. Kessler,et al. Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[78] Robert J. Fowler,et al. Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[79] Christopher C. Hsiung,et al. Cray X-MP: the birth of a supercomputer , 1989, Computer.

[80] Nancy P. Kronenberg,et al. VAXcluster: a closely-coupled distributed system , 1986, TOCS.

[81] Eric A. Brewer,et al. Scalable expanders: exploiting hierarchical random wiring , 1994, STOC '94.

[82] Michel Dubois,et al. Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.

[83] Nitin D. Godiwala,et al. The Second-generation Processor Module for AlphaServer 2100 Systems , 1995, Digit. Tech. J..

[84] J. Y. Ngai,et al. A framework for adaptive routing in multicomputer networks , 1989, CARN.

[85] David J. Schanin. The design and development of a very high speed system bus—the encore Mutlimax nanobus , 1986 .

[86] Alan Jay Smith,et al. A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[87] Allan Porterfield,et al. The Tera computer system , 1990 .

[88] H. B. Bakoglu,et al. Circuits, interconnections, and packaging for VLSI , 1990 .

[89] Anoop Gupta,et al. Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.

[90] Greg J. Regnier,et al. The Virtual Interface Architecture , 2002, IEEE Micro.

[91] Richard Kaufmann,et al. Using the Memory Channel Network , 1997, IEEE Micro.

[92] R. Gillett,et al. Overview of memory channel network for PCI , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[93] C. A. R. Hoare,et al. Communicating Sequential Processes (Reprint) , 1983, Commun. ACM.

[94] Loren Schwiebert,et al. A universal proof technique for deadlock-free routing in interconnection networks , 1995, SPAA '95.

[95] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[96] Samuel P. Morgan,et al. Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..

[97] Andrew W. Wilson,et al. Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[98] W. Daniel Hillis,et al. The connection machine , 1985 .

[99] Michael D. Smith,et al. Limits on multiple instruction issue , 1989, ASPLOS 1989.

[100] Robert J. Harrison,et al. Performance and experience with LAPI-a new high-performance communication library for the IBM RS/6000 SP , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[101] Mark D. Hill,et al. A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[102] H. T. Kung,et al. The design of nectar: a network backplane for heterogeneous multicomputers , 1989, ASPLOS 1989.

[103] Guy E. Blelloch,et al. A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[104] Maurice Herlihy,et al. Impossibility and universality results for wait-free synchronization , 1988, PODC '88.

[105] L. Hernquist,et al. Performance characteristics of tree codes , 1987 .

[106] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[107] Joshua E. Barnes,et al. Error Analysis of a Tree Code , 1989 .

[108] Quinn Snell,et al. HINT: A new way to measure computer performance , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[109] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[110] Kai Li,et al. Two virtual memory mapped network interface designs , 1994, Symposium Record Hot Interconnects II.

[111] Anoop Gupta,et al. Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[112] Daniel L. Slotnick,et al. The SOLOMON computer , 1962, AFIPS '62 (Fall).

[113] Shreekant S. Thakkar,et al. Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[114] Edsger W. Dijkstra,et al. Solution of a problem in concurrent programming control , 1965, CACM.

[115] Beng-Hong Lim,et al. Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[116] Stefanos Kaxiras,et al. The GLOW cache coherence protocol extensions for widely shared data , 1996, ICS '96.

[117] William Gropp,et al. Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[118] Mike Johnson,et al. Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[119] Kunle Olukotun,et al. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[120] Anoop Gupta,et al. Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..

[121] Michel Cekleov,et al. Formal Specification of Memory Models , 1992 .

[122] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[123] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .

[124] Anoop Gupta,et al. Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[125] J. L. Hennessy,et al. An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.

[126] Seth Copen Goldstein,et al. Evaluation of mechanisms for fine-grained parallel programs in the J-machine and the CM-5 , 1993, ISCA '93.

[127] Remzi H. Arpaci-Dusseau,et al. Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[128] Sarita V. Adve,et al. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[129] James R. Larus,et al. Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[130] Norman P. Jouppi,et al. Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[131] Jonathan M. Smith,et al. A high-performance host interface for ATM networks , 1991, SIGCOMM 1991.

[132] William A. Wulf,et al. Overview of the Hydra Operating System development , 1975, SOSP.

[133] Faye A. Briggs,et al. The floating point performance of a superscalar SPARC processor , 1991, ASPLOS IV.

[134] D.A. Wood,et al. Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[135] David E. Culler,et al. A case for NOW (networks of workstation) , 1995, PODC '95.

[136] Anoop Gupta,et al. The Stanford FLASH Multiprocessor , 1994, ISCA.

[137] Richard B. Gillett. Memory Channel Network for PCI , 1996, IEEE Micro.

[138] Paul Hudak,et al. Memory coherence in shared virtual memory systems , 1989, TOCS.

[139] Randy H. Katz,et al. The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS 1989.

[140] Vijay S. Pai,et al. The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[141] D. Burger,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[142] Michel Dubois,et al. Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.

[143] Scott A. Mahlke,et al. IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.

[144] John D. Valois. Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[145] Charles E. Leiserson,et al. How to assemble tree machines (Extended Abstract) , 1982, STOC '82.

[146] Anoop Gupta,et al. The DASH prototype: implementation and performance , 1992, ISCA '92.

[147] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[148] T. Lovett,et al. STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[149] Liviu Iftode,et al. Scope consistency: a bridge between release consistency and entry consistency , 1996, SPAA '96.

[150] Michael Shebanow,et al. Single instruction stream parallelism is greater than two , 1991, ISCA '91.

[151] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[152] Maurice Herlihy,et al. Axioms for concurrent objects , 1987, POPL '87.

[153] Calton Pu,et al. A Lock-Free Multiprocessor OS Kernel , 1992, OPSR.

[154] Donald Yeung,et al. The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[155] John L. Hennessy,et al. SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[156] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[157] Charles R. Vick,et al. PEPE architecture - present and future , 1978, AFIPS National Computer Conference.

[158] T. A. Jeeves,et al. On the use of the SOLOMON parallel-processing computer , 1899, AFIPS '62 (Fall).

[159] Robert W. Horst. TNet: A Reliable System Area Network , 1995, IEEE Micro.

[160] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[161] S. F. Reddaway. DAP—a distributed array processor , 1973, ISCA 1973.

[162] Katherine A. Yelick,et al. Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..

[163] Chris J. Scheiman,et al. Experience with active messages on the Meiko CS-2 , 1995, Proceedings of 9th International Parallel Processing Symposium.

[164] Michel Dubois,et al. Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[165] Donald E. Knuth,et al. Additional comments on a problem in concurrent programming control , 1966, CACM.

[166] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[167] Maurice Herlihy,et al. A methodology for implementing highly concurrent data objects , 1993, TOPL.

[168] Jack J. Dongarra,et al. Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[169] James R. Larus,et al. Mechanisms for cooperative shared memory , 1993, ISCA '93.

[170] Charles L. Seitz,et al. Concurrent VLSI Architectures , 1984, IEEE Transactions on Computers.

[171] David H. Bailey,et al. FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[172] Anoop Gupta,et al. Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[173] Shuichi Sakai,et al. Prototype implementation of a highly parallel dataflow machine EM-4 , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[174] Geoffrey C. Fox,et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[175] Andris Padegs. System/360 and Bayond , 1981, IBM J. Res. Dev..

[176] Janak H. Patel,et al. Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[177] Kourosh Gharachorloo,et al. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[178] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[179] Brian N. Bershad,et al. Software write detection for a distributed shared memory , 1994, OSDI '94.

[180] Andrea C. Arpaci-Dusseau,et al. Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..

[181] Janak H. Patel,et al. Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[182] Robert W. Horst,et al. An architecture for high volume transaction processing , 1985, ISCA '85.

[183] Kenichi Hayashi,et al. Improving AP1000 parallel computer performance with message communication , 1993, ISCA '93.

[184] Alan Jay Smith,et al. Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[185] Anthony J. G. Hey,et al. The Genesis distributed memory benchmarks , 1991, Parallel Comput..

[186] Mark D. Hill,et al. Weak ordering—a new definition , 1998, ISCA '98.

[187] J. E. Thornton,et al. Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[188] Y. Fujita,et al. A 7.68 GIPS 3.84 GB/s 1W parallel image processing RAM integrating a 16 Mb DRAM and 128 processors , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[189] Anoop Gupta,et al. Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity , 1995, J. Parallel Distributed Comput..

[190] Jack J. Dongarra,et al. Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..

[191] James R. Goodman,et al. Performance of Pruning-Cache Directories for Large-Scale Multiprocessors , 1993, IEEE Trans. Parallel Distributed Syst..

[192] M. J. Carlton,et al. Micro benchmark analysis of the KSR1 , 1993, Supercomputing '93.

[193] Srinivasan Parthasarathy,et al. Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[194] Liviu Iftode,et al. Evaluation of hardware write propagation support for next-generation shared virtual memory clusters , 1998, ICS '98.

[195] Christos H. Papadimitriou,et al. The serializability of concurrent database updates , 1979, JACM.

[196] Lionel M. Ni,et al. The turn model for adaptive routing , 1992, ISCA '92.

[197] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.

[198] Manoj Kumar,et al. Unique design concepts in GF11 and their impact on performance , 1992, IBM J. Res. Dev..

[199] Richard M. Karp,et al. An optimal algorithm for on-line bipartite matching , 1990, STOC '90.

[200] K. Olukotun,et al. Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[201] J. M. Barton,et al. Translation Lookaside Buffer Synchronization in a Multiprocessor System , 1988, USENIX Winter.

[202] Jean-Loup Baer,et al. Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[203] Michel Cekleov,et al. XDBus: a high-performance, consistent, packet-switched VLSI bus , 1993, Digest of Papers. Compcon Spring.

[204] Monica S. Lam,et al. Limits of control flow on parallelism , 1992, ISCA '92.

[205] A. Malony,et al. Implementing a parallel C++ runtime system for scalable parallel systems , 1993, Supercomputing '93.

[206] Jack Dongarra,et al. Computer benchmarking: paths and pitfalls , 1987 .

[207] Arvind,et al. T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[208] James R. Larus,et al. Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[209] W. H. Wang,et al. Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.

[210] Willy Zwaenepoel,et al. Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[211] James K. Archibald,et al. Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[212] Steven Fortune,et al. Parallelism in random access machines , 1978, STOC.

[213] Robert W. Horst,et al. A flexible ServerNet-based fault-tolerant architecture , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[214] Al Geist,et al. Network-based concurrent computing on the PVM system , 1992, Concurr. Pract. Exp..

[215] Liviu Iftode,et al. Software support for virtual memory-mapped communication , 1996, Proceedings of International Conference on Parallel Processing.

[216] Burton J. Smith. Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[217] Michel Dubois,et al. Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[218] Anoop Gupta,et al. Memory-reference characteristics of multiprocessor applications under MACH , 1988, SIGMETRICS 1988.

[219] Håkan Grahn,et al. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection , 1996, J. Parallel Distributed Comput..

[220] Michael J. Flynn,et al. Reducing Cache Miss Rates Using Prediction Caches , 1996 .

[221] Michael Burrows,et al. Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[222] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[223] David A. Patterson,et al. Computer architecture (2nd ed.): a quantitative approach , 1996 .

[224] William J. Dally,et al. Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[225] Robert W. Horst,et al. Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.

[226] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[227] Vipin Kumar,et al. Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.

[228] R. H. Katz,et al. Evaluating the performance of four snooping cache coherency protocols , 1989, ISCA '89.

[229] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[230] Lawrence C. Stewart,et al. Firefly: a multiprocessor workstation , 1987, ASPLOS 1987.

[231] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[232] Anoop Gupta,et al. Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[233] David H. Bailey. Misleading Performance Reporting in the Supercomputing Field , 1992, Sci. Program..

[234] Samuel H. Fuller,et al. Cm*: a modular, multi-microprocessor , 1977, AFIPS '77.

[235] Michel Dubois,et al. Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[236] Charles L. Seitz,et al. The cosmic cube , 1985, CACM.

[237] P. Pierce,et al. The Paragon implementation of the NX message passing interface , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[238] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.

[239] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[240] David E. Culler,et al. Virtual network transport protocols for Myrinet , 1998, IEEE Micro.

[241] David E. Culler,et al. Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.

[242] Elliot Nestle,et al. The SYNAPSE N+1 System: architectural characteristics and performance data of a tightly-coupled multiprocessor system , 1985, ISCA '85.

[243] George Karypis,et al. Introduction to Parallel Computing , 1994 .

[244] Corinna Lee. Multistep Gradual Rounding , 1989, IEEE Trans. Computers.

[245] James R. Larus,et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[246] Fong Pong,et al. Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[247] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[248] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS 1989.

[249] Kenji Nishida,et al. An Architecture of a Data Flow Machine and Its Evaluation , 1984, COMPCON.

[250] Allan Gottlieb,et al. Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[251] Mark Horowitz,et al. Performance tradeoffs in cache design , 1988, ISCA '88.

[252] Michael Stumm,et al. Hector: a hierarchically structured shared-memory multiprocessor , 1991, Computer.

[253] Guy L. Steele,et al. The High Performance Fortran Handbook , 1993 .

[254] Seth Copen Goldstein,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[255] Richard L. Sites,et al. Alpha Architecture Reference Manual , 1995 .

[256] Ashok Singhal,et al. The next-generation SPARC multiprocessing system architecture , 1993, Digest of Papers. Compcon Spring.

[257] Thomas E. Anderson,et al. High speed switch scheduling for local area networks , 1992, ASPLOS V.

[258] Bryan S. Rosenburg. Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors , 1989, SOSP '89.

[259] Jack J. Dongarra,et al. The PVM Concurrent Computing System: Evolution, Experiences, and Trends , 1994, Parallel Comput..

[260] Anoop Gupta,et al. The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[261] Todd C. Mowry,et al. Tolerating latency through software-controlled data prefetching , 1994 .

[262] W. Daniel Hillis,et al. Data parallel algorithms , 1986, CACM.

[263] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .

[264] David Banks,et al. A High-Performance Network Architecture for a PA-RISC Workstation , 1993, IEEE J. Sel. Areas Commun..

[265] James R. Goodman,et al. The Impact of Pipelined Channels on k-ary n-Cube Networks , 1994, IEEE Trans. Parallel Distributed Syst..

[266] A. Richard Newton,et al. An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.

[267] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[268] Katherine A. Yelick,et al. Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[269] Thomas J. LeBlanc,et al. Adjustable block size coherent caches , 1992, ISCA '92.

[270] James R. Goodman,et al. Cache Consistency and Sequential Consistency , 1991 .

[271] Alan Jay Smith,et al. Cache Memories , 1982, CSUR.

[272] Anoop Gupta,et al. Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[273] Jim Savage,et al. Parallel processing as a language design problem , 1985, ISCA '85.

[274] Charles L. Seitz,et al. Multicomputers: message-passing concurrent computers , 1988, Computer.

[275] P. R. Cappello,et al. Implementing the beam and warming method on the hypercube , 1989, C3P.

[276] Anoop Gupta,et al. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[277] Livio Ricciulli,et al. The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[278] Yale Patt,et al. Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, ISCA '91.

[279] Michael S. Warren,et al. Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems , 1994, Int. J. High Perform. Comput. Appl..

[280] S.-Y.R. Li. Theory of periodic contention and its application to packet switching , 1988, IEEE INFOCOM '88,Seventh Annual Joint Conference of the IEEE Computer and Communcations Societies. Networks: Evolution or Revolution?.

[281] Edsger W. Dijkstra,et al. Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[282] Kai Li,et al. Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[283] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.

[284] S. G. Tucker,et al. The IBM 3090 System: An Overview , 1986, IBM Syst. J..

[285] B. Delagi,et al. Distributed-directory scheme: Stanford distributed-directory protocol , 1990, Computer.

[286] Eric A. Brewer,et al. How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.

[287] Eric Williams,et al. Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[288] K. Gunther,et al. Prevention of Deadlocks in Packet-Switched Data Transport Systems , 1981 .

[289] Liviu Iftode,et al. Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[290] David E. Culler,et al. Two Fundamental Limits on Dataflow Multiprocessing , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[291] H. T. Kung,et al. Supporting systolic and memory communication in iWarp , 1990, ISCA '90.

[292] David E. Culler,et al. Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[293] James H. Patterson,et al. Portable Programs for Parallel Processors , 1987 .

[294] Ronald Minnich,et al. The memory-integrated network interface , 1995, IEEE Micro.

[295] Allan Gottlieb,et al. Complexity Results for Permuting Data and Other Computations on Parallel Processors , 1984, JACM.

[296] Patricia J. Teller. Translation-lookaside buffer consistency , 1990, Computer.

[297] Brian N. Bershad,et al. The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[298] Andrew Wilson,et al. Shared memory multiprocessors: the right approach to parallel processing , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[299] William J. Dally,et al. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[300] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[301] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[302] W. Daniel Hillis,et al. The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[303] Anoop Gupta,et al. Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[304] Dean M. Tullsen,et al. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[305] Sarita V. Adve,et al. Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[306] Gregory F. Pfister,et al. “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[307] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.

[308] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[309] John K. Salmon,et al. Parallel hierarchical N-body methods , 1992 .

[310] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[311] F. Baskett,et al. The 4D-MP graphics superworkstation: computing+graphics=40 MIPS+MFLOPS and 100000 lighted polygons per second , 1988, Digest of Papers. COMPCON Spring 88 Thirty-Third IEEE Computer Society International Conference.

[312] Anoop Gupta,et al. Modeling communication in parallel algorithms: a fruitful interaction between theory and systems? , 1994, SPAA '94.

[313] Dennis Shasha,et al. Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[314] Mark D. Hill,et al. Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[315] Todd C. Mowry,et al. Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[316] Forum Mpi. MPI: A Message-Passing Interface , 1994 .

[317] James L. Flanagan,et al. Technologies for multimedia communications , 1994, Proc. IEEE.

[318] Mosur Ravishankar,et al. PLUS: a distributed shared-memory system , 1990, ISCA '90.

[319] Jaswinder Pal Singh,et al. Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[320] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[321] Alexander Aiken,et al. Optimal loop parallelization , 1988, PLDI '88.

[322] David R. Cheriton,et al. The synergy between non-blocking synchronization and operating system structure , 1996, OSDI '96.

[323] Pat Hanrahan,et al. A rapid hierarchical radiosity algorithm , 1991, SIGGRAPH.

[324] James P. Anderson,et al. D825 - a multiple-computer system for command & control , 1962, AFIPS '62 (Fall).

[325] Masahiro Yoshida,et al. Development and achievement of NAL Numerical Wind Tunnel (NWT) for CFD computations , 1994, Proceedings of Supercomputing '94.

[326] Seth Copen Goldstein,et al. NIFDY: a low overhead, high throughput network interface , 1995, ISCA.

[327] James P. Laudon,et al. Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .

[328] Christopher F. Joerg,et al. The Monsoon interconnection network , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[329] Leonard Kleinrock,et al. Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.

[330] Bruce S. Davie,et al. Computer Networks: A Systems Approach , 1996 .

[331] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[332] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[333] Anoop Gupta,et al. Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[334] James V. Lawton,et al. Building a High-performance Message-passing System for MEMORY CHANNEL Clusters , 1996, Digit. Tech. J..

[335] Randall Rettberg,et al. Contention is no obstacle to shared-memory multiprocessing , 1986, CACM.

[336] Peter S. Pacheco. Parallel programming with MPI , 1996 .

[337] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[338] John R. Nickolls,et al. The design of the MasPar MP-1: a cost effective massively parallel computer , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[339] Daniel Shawcross Wilkerson,et al. System area network mapping , 1997, SPAA '97.

[340] Kenneth E. Batcher,et al. Design of a Massively Parallel Processor , 1980, IEEE Transactions on Computers.

[341] V. Gerald Grafe,et al. The Epsilon-2 hybrid dataflow architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[342] Jim Gray,et al. Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[343] Thorsten von Eicken,et al. Low-Latency Communication Over ATM Networks Using Active Messages , 1995, IEEE Micro.

[344] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[345] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[346] Ahmed Sameh,et al. The Illiac IV system , 1972 .

[347] Ian Watson,et al. The Manchester prototype dataflow computer , 1985, CACM.

[348] Sarita V. Adve,et al. Designing memory consistency models for shared-memory multiprocessors , 1993 .

[349] Jaswinder Pal Singh,et al. Hierarchical n-body methods and their implications for multiprocessors , 1993 .

[350] Daniel H. Linder,et al. An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-Ary n-Cubes , 1994, IEEE Trans. Computers.

[351] Andrew A. Chien,et al. Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.

[352] Josep Torrellas,et al. Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching , 1995, ISCA.

[353] D. Burger,et al. Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[354] M. S. Warren,et al. A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.

[355] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[356] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[357] Anoop Gupta,et al. Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[358] Martin Walker,et al. A Shared Memory MPP from Cray Research , 1994, Digit. Tech. J..

[359] Michel Dubois,et al. Implementation and evaluation of update-based cache protocols under relaxed memory consistency models , 1995, Future Gener. Comput. Syst..

[360] Katherine A. Yelick,et al. Optimizing Parallel SPMD Programs , 1994, LCPC.

[361] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[362] Liviu Iftode,et al. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[363] Gregory G. Finn,et al. ATOMIC: A Low-Cost, Very-High-Speed, Local Communication Architecture , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[364] Rudolf Eigenmann,et al. Benchmarking with real industrial applications: the SPEC High-Performance Group , 1996 .

[365] Anna R. Karlin,et al. Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[366] Larry Rudolph,et al. Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors , 1983, TOPL.

[367] Gregory G. Finn,et al. Atomic: A High-Speed Local Communication Architecture , 1994, J. High Speed Networks.

[368] Y. Tamir,et al. High-performance multi-queue buffers for VLSI communications switches , 1988, ISCA '88.

[369] Kourosh Gharachorloo,et al. Memory consistency models for shared-memory multiprocessors , 1995 .

[370] Duncan G. Elliott,et al. Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[371] Janak H. Patel,et al. A low-overhead coherence solution for multiprocessors with private cache memories , 1998, ISCA '98.

[372] David M. Fenwick,et al. The AlphaServer 8000 Series: High-end Server Platform Development , 1995, Digit. Tech. J..