Parallel computer architecture - a hardware / software approach
暂无分享,去创建一个
[1] Eric A. Brewer,et al. Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.
[2] Wen-Hann Wang,et al. On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.
[3] Michael Stumm,et al. Cache consistency in hierarchical-ring-based multiprocessors , 1992, Proceedings Supercomputing '92.
[4] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.
[5] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.
[6] Calvin K. Tang. Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.
[7] Marc Snir,et al. The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.
[8] James R. Goodman. Using cache memory to reduce processor-memory traffic , 1998, ISCA '98.
[9] R. S. Nikhil. Can dataflow subsume von Neumann computing? , 1989, ISCA '89.
[10] Anant Agarwal,et al. APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.
[11] William J. Dally,et al. The J-machine network , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.
[12] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[13] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..
[14] Josep Torrellas,et al. False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.
[15] Alan L. Cox,et al. Lazy release consistency for software distributed shared memory , 1992, ISCA '92.
[16] Charles E. Leiserson,et al. Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..
[17] Jean-Loup Baer,et al. A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.
[18] Jonathan S. Turner,et al. Design of a broadcast packet switching network , 1988, IEEE Trans. Commun..
[19] Anant Agarwal,et al. Anatomy of a message in the Alewife multiprocessor , 1993 .
[20] F. Leighton,et al. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .
[21] Richard P. Martin,et al. HPAM: an active message layer for a network of hp workstations , 1994, Symposium Record Hot Interconnects II.
[22] V. Benes,et al. Mathematical Theory of Connecting Networks and Telephone Traffic. , 1966 .
[23] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.
[24] Peter M. Kogge,et al. EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.
[25] Michael J. Flynn,et al. Latency Tolerance for Dynamic Processors , 1996 .
[26] John S. Keen,et al. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[27] Jaswinder Pal Singh,et al. A methodology and an evaluation of the SGI Origin2000 , 1998, SIGMETRICS '98/PERFORMANCE '98.
[28] Luiz André Barroso,et al. The performance of cache-coherent ring-based multiprocessors , 1993, ISCA '93.
[29] Alan L. Cox,et al. Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.
[30] Kenneth E. Batcher. STARAN parallel processor system hardware , 1974, AFIPS '74.
[31] Jack B. Dennis,et al. Data Flow Supercomputers , 1980, Computer.
[32] Monica S. Lam,et al. Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.
[33] Michael L. Scott,et al. Using memory-mapped network interfaces to improve the performance of distributed shared memory , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.
[34] Dirk Roose,et al. Benchmarking the iPSC/2 Hypercube Multiprocessor , 1989, Concurr. Pract. Exp..
[35] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .
[36] Dean M. Tullsen,et al. Limitations of cache prefetching on a bus-based multiprocessor , 1993, ISCA '93.
[37] Report,et al. Public International Benchmarks for Parallel Computers , 1993 .
[38] Richard J. Swan,et al. The implementation of the Cm* multi-microprocessor , 1899, AFIPS '77.
[39] James Cownie,et al. Message Passing on the Meiko CS-2 , 1994, Parallel Comput..
[40] K. Mani Chandy,et al. Parallel program design - a foundation , 1988 .
[41] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[42] Moriyoshi Ohara,et al. Producer-oriented versus consumer-oriented prefetching: a comparison and analysis of parallel application programs , 1996 .
[43] Anant Agarwal,et al. Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..
[44] Stephen R. Goldschmidt,et al. Simulation of multiprocessors: accuracy and performance , 1993 .
[45] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[46] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[47] Henry Fuchs,et al. Near real-time shaded display of rigid objects , 1983, SIGGRAPH.
[48] Sarita V. Adve,et al. An evaluation of memory consistency models for shared-memory systems with ILP processors , 1996, ASPLOS VII.
[49] Alan Jay Smith,et al. Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.
[50] S.K. Reinhardt,et al. Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[51] David E. Culler,et al. Active Message Applications Programming Interface , 1996 .
[52] David A. Wood,et al. An in-cache address translation mechanism , 1986, ISCA '86.
[53] Anoop Gupta,et al. The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.
[54] C. G. Bell. Multis: A New Class of Multiprocessor Computers , 1985, Science.
[55] S. Konstantinidou,et al. Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.
[56] John B. Carter,et al. An argument for simple COMA , 1995, Future Gener. Comput. Syst..
[57] Susan J. Eggers,et al. Eliminating False Sharing , 1991, ICPP.
[58] David P. Rodgers,et al. Improvements in multiprocessor system design , 1985, ISCA '85.
[59] Burton J. Smith,et al. The Horizon supercomputing system: architecture and software , 1988, Proceedings. SUPERCOMPUTING '88.
[60] Fredrik Dahlgren. Boosting the performance of hybrid snooping cache protocols , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[61] Monica S. Lam,et al. The design and evaluation of a shared object system for distributed memory machines , 1994, OSDI '94.
[62] John L. Hennessy,et al. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .
[63] William J. Dally,et al. The Named-State Register File: implementation and performance , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.
[64] E. Biagioni,et al. Designing a practical ATM LAN , 1993, IEEE Network.
[65] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.
[66] Stefanos Kaxiras,et al. Kiloprocessor Extensions to SCI , 1996, Proceedings of International Conference on Parallel Processing.
[67] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.
[68] Jean-Loup Baer,et al. An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[69] Truman Joe. COMA-F: a non-hierarchical cache only memory architecture , 1995 .
[70] Harold S. Stone,et al. A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.
[71] John L. Hennessy,et al. Evaluating the memory overhead required for COMA architectures , 1994, ISCA '94.
[72] Peter J. Denning,et al. The working set model for program behavior , 1968, CACM.
[73] Michael D. Noakes,et al. The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.
[74] GuptaAnoop,et al. Parallel Visualization Algorithms , 1994 .
[75] Yoichi Koyanagi,et al. AP1000+: architectural support of PUT/GET interface for parallelizing compiler , 1994, ASPLOS VI.
[76] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.
[77] R. E. Kessler,et al. Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.
[78] Robert J. Fowler,et al. Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.
[79] Christopher C. Hsiung,et al. Cray X-MP: the birth of a supercomputer , 1989, Computer.
[80] Nancy P. Kronenberg,et al. VAXcluster: a closely-coupled distributed system , 1986, TOCS.
[81] Eric A. Brewer,et al. Scalable expanders: exploiting hierarchical random wiring , 1994, STOC '94.
[82] Michel Dubois,et al. Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.
[83] Nitin D. Godiwala,et al. The Second-generation Processor Module for AlphaServer 2100 Systems , 1995, Digit. Tech. J..
[84] J. Y. Ngai,et al. A framework for adaptive routing in multicomputer networks , 1989, CARN.
[85] David J. Schanin. The design and development of a very high speed system bus—the encore Mutlimax nanobus , 1986 .
[86] Alan Jay Smith,et al. A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.
[87] Allan Porterfield,et al. The Tera computer system , 1990 .
[88] H. B. Bakoglu,et al. Circuits, interconnections, and packaging for VLSI , 1990 .
[89] Anoop Gupta,et al. Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.
[90] Greg J. Regnier,et al. The Virtual Interface Architecture , 2002, IEEE Micro.
[91] Richard Kaufmann,et al. Using the Memory Channel Network , 1997, IEEE Micro.
[92] R. Gillett,et al. Overview of memory channel network for PCI , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.
[93] C. A. R. Hoare,et al. Communicating Sequential Processes (Reprint) , 1983, Commun. ACM.
[94] Loren Schwiebert,et al. A universal proof technique for deadlock-free routing in interconnection networks , 1995, SPAA '95.
[95] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.
[96] Samuel P. Morgan,et al. Input Versus Output Queueing on a Space-Division Packet Switch , 1987, IEEE Trans. Commun..
[97] Andrew W. Wilson,et al. Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.
[98] W. Daniel Hillis,et al. The connection machine , 1985 .
[99] Michael D. Smith,et al. Limits on multiple instruction issue , 1989, ASPLOS 1989.
[100] Robert J. Harrison,et al. Performance and experience with LAPI-a new high-performance communication library for the IBM RS/6000 SP , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.
[101] Mark D. Hill,et al. A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..
[102] H. T. Kung,et al. The design of nectar: a network backplane for heterogeneous multicomputers , 1989, ASPLOS 1989.
[103] Guy E. Blelloch,et al. A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.
[104] Maurice Herlihy,et al. Impossibility and universality results for wait-free synchronization , 1988, PODC '88.
[105] L. Hernquist,et al. Performance characteristics of tree codes , 1987 .
[106] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[107] Joshua E. Barnes,et al. Error Analysis of a Tree Code , 1989 .
[108] Quinn Snell,et al. HINT: A new way to measure computer performance , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.
[109] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.
[110] Kai Li,et al. Two virtual memory mapped network interface designs , 1994, Symposium Record Hot Interconnects II.
[111] Anoop Gupta,et al. Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.
[112] Daniel L. Slotnick,et al. The SOLOMON computer , 1962, AFIPS '62 (Fall).
[113] Shreekant S. Thakkar,et al. Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.
[114] Edsger W. Dijkstra,et al. Solution of a problem in concurrent programming control , 1965, CACM.
[115] Beng-Hong Lim,et al. Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.
[116] Stefanos Kaxiras,et al. The GLOW cache coherence protocol extensions for widely shared data , 1996, ICS '96.
[117] William Gropp,et al. Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .
[118] Mike Johnson,et al. Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.
[119] Kunle Olukotun,et al. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[120] Anoop Gupta,et al. Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..
[121] Michel Cekleov,et al. Formal Specification of Memory Models , 1992 .
[122] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.
[123] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .
[124] Anoop Gupta,et al. Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.
[125] J. L. Hennessy,et al. An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.
[126] Seth Copen Goldstein,et al. Evaluation of mechanisms for fine-grained parallel programs in the J-machine and the CM-5 , 1993, ISCA '93.
[127] Remzi H. Arpaci-Dusseau,et al. Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[128] Sarita V. Adve,et al. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.
[129] James R. Larus,et al. Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.
[130] Norman P. Jouppi,et al. Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.
[131] Jonathan M. Smith,et al. A high-performance host interface for ATM networks , 1991, SIGCOMM 1991.
[132] William A. Wulf,et al. Overview of the Hydra Operating System development , 1975, SOSP.
[133] Faye A. Briggs,et al. The floating point performance of a superscalar SPARC processor , 1991, ASPLOS IV.
[134] D.A. Wood,et al. Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[135] David E. Culler,et al. A case for NOW (networks of workstation) , 1995, PODC '95.
[136] Anoop Gupta,et al. The Stanford FLASH Multiprocessor , 1994, ISCA.
[137] Richard B. Gillett. Memory Channel Network for PCI , 1996, IEEE Micro.
[138] Paul Hudak,et al. Memory coherence in shared virtual memory systems , 1989, TOCS.
[139] Randy H. Katz,et al. The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS 1989.
[140] Vijay S. Pai,et al. The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[141] D. Burger,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[142] Michel Dubois,et al. Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.
[143] Scott A. Mahlke,et al. IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.
[144] John D. Valois. Lock-free linked lists using compare-and-swap , 1995, PODC '95.
[145] Charles E. Leiserson,et al. How to assemble tree machines (Extended Abstract) , 1982, STOC '82.
[146] Anoop Gupta,et al. The DASH prototype: implementation and performance , 1992, ISCA '92.
[147] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .
[148] T. Lovett,et al. STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[149] Liviu Iftode,et al. Scope consistency: a bridge between release consistency and entry consistency , 1996, SPAA '96.
[150] Michael Shebanow,et al. Single instruction stream parallelism is greater than two , 1991, ISCA '91.
[151] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.
[152] Maurice Herlihy,et al. Axioms for concurrent objects , 1987, POPL '87.
[153] Calton Pu,et al. A Lock-Free Multiprocessor OS Kernel , 1992, OPSR.
[154] Donald Yeung,et al. The MIT Alewife machine: architecture and performance , 1995, ISCA '98.
[155] John L. Hennessy,et al. SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.
[156] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.
[157] Charles R. Vick,et al. PEPE architecture - present and future , 1978, AFIPS National Computer Conference.
[158] T. A. Jeeves,et al. On the use of the SOLOMON parallel-processing computer , 1899, AFIPS '62 (Fall).
[159] Robert W. Horst. TNet: A Reliable System Area Network , 1995, IEEE Micro.
[160] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.
[161] S. F. Reddaway. DAP—a distributed array processor , 1973, ISCA 1973.
[162] Katherine A. Yelick,et al. Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..
[163] Chris J. Scheiman,et al. Experience with active messages on the Meiko CS-2 , 1995, Proceedings of 9th International Parallel Processing Symposium.
[164] Michel Dubois,et al. Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..
[165] Donald E. Knuth,et al. Additional comments on a problem in concurrent programming control , 1966, CACM.
[166] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.
[167] Maurice Herlihy,et al. A methodology for implementing highly concurrent data objects , 1993, TOPL.
[168] Jack J. Dongarra,et al. Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.
[169] James R. Larus,et al. Mechanisms for cooperative shared memory , 1993, ISCA '93.
[170] Charles L. Seitz,et al. Concurrent VLSI Architectures , 1984, IEEE Transactions on Computers.
[171] David H. Bailey,et al. FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[172] Anoop Gupta,et al. Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.
[173] Shuichi Sakai,et al. Prototype implementation of a highly parallel dataflow machine EM-4 , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.
[174] Geoffrey C. Fox,et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..
[175] Andris Padegs. System/360 and Bayond , 1981, IBM J. Res. Dev..
[176] Janak H. Patel,et al. Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.
[177] Kourosh Gharachorloo,et al. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.
[178] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.
[179] Brian N. Bershad,et al. Software write detection for a distributed shared memory , 1994, OSDI '94.
[180] Andrea C. Arpaci-Dusseau,et al. Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..
[181] Janak H. Patel,et al. Stride directed prefetching in scalar processors , 1992, MICRO 1992.
[182] Robert W. Horst,et al. An architecture for high volume transaction processing , 1985, ISCA '85.
[183] Kenichi Hayashi,et al. Improving AP1000 parallel computer performance with message communication , 1993, ISCA '93.
[184] Alan Jay Smith,et al. Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.
[185] Anthony J. G. Hey,et al. The Genesis distributed memory benchmarks , 1991, Parallel Comput..
[186] Mark D. Hill,et al. Weak ordering—a new definition , 1998, ISCA '98.
[187] J. E. Thornton,et al. Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).
[188] Y. Fujita,et al. A 7.68 GIPS 3.84 GB/s 1W parallel image processing RAM integrating a 16 Mb DRAM and 128 processors , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.
[189] Anoop Gupta,et al. Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity , 1995, J. Parallel Distributed Comput..
[190] Jack J. Dongarra,et al. Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..
[191] James R. Goodman,et al. Performance of Pruning-Cache Directories for Large-Scale Multiprocessors , 1993, IEEE Trans. Parallel Distributed Syst..
[192] M. J. Carlton,et al. Micro benchmark analysis of the KSR1 , 1993, Supercomputing '93.
[193] Srinivasan Parthasarathy,et al. Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.
[194] Liviu Iftode,et al. Evaluation of hardware write propagation support for next-generation shared virtual memory clusters , 1998, ICS '98.
[195] Christos H. Papadimitriou,et al. The serializability of concurrent database updates , 1979, JACM.
[196] Lionel M. Ni,et al. The turn model for adaptive routing , 1992, ISCA '92.
[197] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.
[198] Manoj Kumar,et al. Unique design concepts in GF11 and their impact on performance , 1992, IBM J. Res. Dev..
[199] Richard M. Karp,et al. An optimal algorithm for on-line bipartite matching , 1990, STOC '90.
[200] K. Olukotun,et al. Evaluation of Design Alternatives for a Multiprocessor Microprocessor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[201] J. M. Barton,et al. Translation Lookaside Buffer Synchronization in a Multiprocessor System , 1988, USENIX Winter.
[202] Jean-Loup Baer,et al. Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.
[203] Michel Cekleov,et al. XDBus: a high-performance, consistent, packet-switched VLSI bus , 1993, Digest of Papers. Compcon Spring.
[204] Monica S. Lam,et al. Limits of control flow on parallelism , 1992, ISCA '92.
[205] A. Malony,et al. Implementing a parallel C++ runtime system for scalable parallel systems , 1993, Supercomputing '93.
[206] Jack Dongarra,et al. Computer benchmarking: paths and pitfalls , 1987 .
[207] Arvind,et al. T: a multithreaded massively parallel architecture , 1992, ISCA '92.
[208] James R. Larus,et al. Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.
[209] W. H. Wang,et al. Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.
[210] Willy Zwaenepoel,et al. Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.
[211] James K. Archibald,et al. Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.
[212] Steven Fortune,et al. Parallelism in random access machines , 1978, STOC.
[213] Robert W. Horst,et al. A flexible ServerNet-based fault-tolerant architecture , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[214] Al Geist,et al. Network-based concurrent computing on the PVM system , 1992, Concurr. Pract. Exp..
[215] Liviu Iftode,et al. Software support for virtual memory-mapped communication , 1996, Proceedings of International Conference on Parallel Processing.
[216] Burton J. Smith. Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.
[217] Michel Dubois,et al. Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[218] Anoop Gupta,et al. Memory-reference characteristics of multiprocessor applications under MACH , 1988, SIGMETRICS 1988.
[219] Håkan Grahn,et al. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection , 1996, J. Parallel Distributed Comput..
[220] Michael J. Flynn,et al. Reducing Cache Miss Rates Using Prediction Caches , 1996 .
[221] Michael Burrows,et al. Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..
[222] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[223] David A. Patterson,et al. Computer architecture (2nd ed.): a quantitative approach , 1996 .
[224] William J. Dally,et al. Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.
[225] Robert W. Horst,et al. Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.
[226] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.
[227] Vipin Kumar,et al. Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.
[228] R. H. Katz,et al. Evaluating the performance of four snooping cache coherency protocols , 1989, ISCA '89.
[229] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1998, ISCA '98.
[230] Lawrence C. Stewart,et al. Firefly: a multiprocessor workstation , 1987, ASPLOS 1987.
[231] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.
[232] Anoop Gupta,et al. Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.
[233] David H. Bailey. Misleading Performance Reporting in the Supercomputing Field , 1992, Sci. Program..
[234] Samuel H. Fuller,et al. Cm*: a modular, multi-microprocessor , 1977, AFIPS '77.
[235] Michel Dubois,et al. Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..
[236] Charles L. Seitz,et al. The cosmic cube , 1985, CACM.
[237] P. Pierce,et al. The Paragon implementation of the NX message passing interface , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.
[238] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.
[239] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[240] David E. Culler,et al. Virtual network transport protocols for Myrinet , 1998, IEEE Micro.
[241] David E. Culler,et al. Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.
[242] Elliot Nestle,et al. The SYNAPSE N+1 System: architectural characteristics and performance data of a tightly-coupled multiprocessor system , 1985, ISCA '85.
[243] George Karypis,et al. Introduction to Parallel Computing , 1994 .
[244] Corinna Lee. Multistep Gradual Rounding , 1989, IEEE Trans. Computers.
[245] James R. Larus,et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.
[246] Fong Pong,et al. Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[247] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.
[248] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS 1989.
[249] Kenji Nishida,et al. An Architecture of a Data Flow Machine and Its Evaluation , 1984, COMPCON.
[250] Allan Gottlieb,et al. Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.
[251] Mark Horowitz,et al. Performance tradeoffs in cache design , 1988, ISCA '88.
[252] Michael Stumm,et al. Hector: a hierarchically structured shared-memory multiprocessor , 1991, Computer.
[253] Guy L. Steele,et al. The High Performance Fortran Handbook , 1993 .
[254] Seth Copen Goldstein,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.
[255] Richard L. Sites,et al. Alpha Architecture Reference Manual , 1995 .
[256] Ashok Singhal,et al. The next-generation SPARC multiprocessing system architecture , 1993, Digest of Papers. Compcon Spring.
[257] Thomas E. Anderson,et al. High speed switch scheduling for local area networks , 1992, ASPLOS V.
[258] Bryan S. Rosenburg. Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors , 1989, SOSP '89.
[259] Jack J. Dongarra,et al. The PVM Concurrent Computing System: Evolution, Experiences, and Trends , 1994, Parallel Comput..
[260] Anoop Gupta,et al. The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..
[261] Todd C. Mowry,et al. Tolerating latency through software-controlled data prefetching , 1994 .
[262] W. Daniel Hillis,et al. Data parallel algorithms , 1986, CACM.
[263] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .
[264] David Banks,et al. A High-Performance Network Architecture for a PA-RISC Workstation , 1993, IEEE J. Sel. Areas Commun..
[265] James R. Goodman,et al. The Impact of Pipelined Channels on k-ary n-Cube Networks , 1994, IEEE Trans. Parallel Distributed Syst..
[266] A. Richard Newton,et al. An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.
[267] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.
[268] Katherine A. Yelick,et al. Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.
[269] Thomas J. LeBlanc,et al. Adjustable block size coherent caches , 1992, ISCA '92.
[270] James R. Goodman,et al. Cache Consistency and Sequential Consistency , 1991 .
[271] Alan Jay Smith,et al. Cache Memories , 1982, CSUR.
[272] Anoop Gupta,et al. Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..
[273] Jim Savage,et al. Parallel processing as a language design problem , 1985, ISCA '85.
[274] Charles L. Seitz,et al. Multicomputers: message-passing concurrent computers , 1988, Computer.
[275] P. R. Cappello,et al. Implementing the beam and warming method on the hypercube , 1989, C3P.
[276] Anoop Gupta,et al. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.
[277] Livio Ricciulli,et al. The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.
[278] Yale Patt,et al. Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, ISCA '91.
[279] Michael S. Warren,et al. Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems , 1994, Int. J. High Perform. Comput. Appl..
[280] S.-Y.R. Li. Theory of periodic contention and its application to packet switching , 1988, IEEE INFOCOM '88,Seventh Annual Joint Conference of the IEEE Computer and Communcations Societies. Networks: Evolution or Revolution?.
[281] Edsger W. Dijkstra,et al. Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..
[282] Kai Li,et al. Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[283] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.
[284] S. G. Tucker,et al. The IBM 3090 System: An Overview , 1986, IBM Syst. J..
[285] B. Delagi,et al. Distributed-directory scheme: Stanford distributed-directory protocol , 1990, Computer.
[286] Eric A. Brewer,et al. How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.
[287] Eric Williams,et al. Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.
[288] K. Gunther,et al. Prevention of Deadlocks in Packet-Switched Data Transport Systems , 1981 .
[289] Liviu Iftode,et al. Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.
[290] David E. Culler,et al. Two Fundamental Limits on Dataflow Multiprocessing , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.
[291] H. T. Kung,et al. Supporting systolic and memory communication in iWarp , 1990, ISCA '90.
[292] David E. Culler,et al. Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.
[293] James H. Patterson,et al. Portable Programs for Parallel Processors , 1987 .
[294] Ronald Minnich,et al. The memory-integrated network interface , 1995, IEEE Micro.
[295] Allan Gottlieb,et al. Complexity Results for Permuting Data and Other Computations on Parallel Processors , 1984, JACM.
[296] Patricia J. Teller. Translation-lookaside buffer consistency , 1990, Computer.
[297] Brian N. Bershad,et al. The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.
[298] Andrew Wilson,et al. Shared memory multiprocessors: the right approach to parallel processing , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.
[299] William J. Dally,et al. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.
[300] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.
[301] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.
[302] W. Daniel Hillis,et al. The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.
[303] Anoop Gupta,et al. Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.
[304] Dean M. Tullsen,et al. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.
[305] Sarita V. Adve,et al. Shared Memory Consistency Models: A Tutorial , 1996, Computer.
[306] Gregory F. Pfister,et al. “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.
[307] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.
[308] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.
[309] John K. Salmon,et al. Parallel hierarchical N-body methods , 1992 .
[310] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.
[311] F. Baskett,et al. The 4D-MP graphics superworkstation: computing+graphics=40 MIPS+MFLOPS and 100000 lighted polygons per second , 1988, Digest of Papers. COMPCON Spring 88 Thirty-Third IEEE Computer Society International Conference.
[312] Anoop Gupta,et al. Modeling communication in parallel algorithms: a fruitful interaction between theory and systems? , 1994, SPAA '94.
[313] Dennis Shasha,et al. Efficient and correct execution of parallel programs that share memory , 1988, TOPL.
[314] Mark D. Hill,et al. Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.
[315] Todd C. Mowry,et al. Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.
[316] Forum Mpi. MPI: A Message-Passing Interface , 1994 .
[317] James L. Flanagan,et al. Technologies for multimedia communications , 1994, Proc. IEEE.
[318] Mosur Ravishankar,et al. PLUS: a distributed shared-memory system , 1990, ISCA '90.
[319] Jaswinder Pal Singh,et al. Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.
[320] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.
[321] Alexander Aiken,et al. Optimal loop parallelization , 1988, PLDI '88.
[322] David R. Cheriton,et al. The synergy between non-blocking synchronization and operating system structure , 1996, OSDI '96.
[323] Pat Hanrahan,et al. A rapid hierarchical radiosity algorithm , 1991, SIGGRAPH.
[324] James P. Anderson,et al. D825 - a multiple-computer system for command & control , 1962, AFIPS '62 (Fall).
[325] Masahiro Yoshida,et al. Development and achievement of NAL Numerical Wind Tunnel (NWT) for CFD computations , 1994, Proceedings of Supercomputing '94.
[326] Seth Copen Goldstein,et al. NIFDY: a low overhead, high throughput network interface , 1995, ISCA.
[327] James P. Laudon,et al. Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .
[328] Christopher F. Joerg,et al. The Monsoon interconnection network , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.
[329] Leonard Kleinrock,et al. Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.
[330] Bruce S. Davie,et al. Computer Networks: A Systems Approach , 1996 .
[331] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[332] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.
[333] Anoop Gupta,et al. Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.
[334] James V. Lawton,et al. Building a High-performance Message-passing System for MEMORY CHANNEL Clusters , 1996, Digit. Tech. J..
[335] Randall Rettberg,et al. Contention is no obstacle to shared-memory multiprocessing , 1986, CACM.
[336] Peter S. Pacheco. Parallel programming with MPI , 1996 .
[337] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.
[338] John R. Nickolls,et al. The design of the MasPar MP-1: a cost effective massively parallel computer , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.
[339] Daniel Shawcross Wilkerson,et al. System area network mapping , 1997, SPAA '97.
[340] Kenneth E. Batcher,et al. Design of a Massively Parallel Processor , 1980, IEEE Transactions on Computers.
[341] V. Gerald Grafe,et al. The Epsilon-2 hybrid dataflow architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.
[342] Jim Gray,et al. Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .
[343] Thorsten von Eicken,et al. Low-Latency Communication Over ATM Networks Using Active Messages , 1995, IEEE Micro.
[344] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.
[345] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.
[346] Ahmed Sameh,et al. The Illiac IV system , 1972 .
[347] Ian Watson,et al. The Manchester prototype dataflow computer , 1985, CACM.
[348] Sarita V. Adve,et al. Designing memory consistency models for shared-memory multiprocessors , 1993 .
[349] Jaswinder Pal Singh,et al. Hierarchical n-body methods and their implications for multiprocessors , 1993 .
[350] Daniel H. Linder,et al. An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-Ary n-Cubes , 1994, IEEE Trans. Computers.
[351] Andrew A. Chien,et al. Planar-adaptive routing: low-cost adaptive networks for multiprocessors , 1992, ISCA '92.
[352] Josep Torrellas,et al. Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching , 1995, ISCA.
[353] D. Burger,et al. Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[354] M. S. Warren,et al. A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.
[355] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.
[356] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.
[357] Anoop Gupta,et al. Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.
[358] Martin Walker,et al. A Shared Memory MPP from Cray Research , 1994, Digit. Tech. J..
[359] Michel Dubois,et al. Implementation and evaluation of update-based cache protocols under relaxed memory consistency models , 1995, Future Gener. Comput. Syst..
[360] Katherine A. Yelick,et al. Optimizing Parallel SPMD Programs , 1994, LCPC.
[361] David B. Gustavson. The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.
[362] Liviu Iftode,et al. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.
[363] Gregory G. Finn,et al. ATOMIC: A Low-Cost, Very-High-Speed, Local Communication Architecture , 1993, 1993 International Conference on Parallel Processing - ICPP'93.
[364] Rudolf Eigenmann,et al. Benchmarking with real industrial applications: the SPEC High-Performance Group , 1996 .
[365] Anna R. Karlin,et al. Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).
[366] Larry Rudolph,et al. Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors , 1983, TOPL.
[367] Gregory G. Finn,et al. Atomic: A High-Speed Local Communication Architecture , 1994, J. High Speed Networks.
[368] Y. Tamir,et al. High-performance multi-queue buffers for VLSI communications switches , 1988, ISCA '88.
[369] Kourosh Gharachorloo,et al. Memory consistency models for shared-memory multiprocessors , 1995 .
[370] Duncan G. Elliott,et al. Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.
[371] Janak H. Patel,et al. A low-overhead coherence solution for multiprocessors with private cache memories , 1998, ISCA '98.
[372] David M. Fenwick,et al. The AlphaServer 8000 Series: High-end Server Platform Development , 1995, Digit. Tech. J..