Hybrid coherence for scalable multicore architectures
暂无分享,去创建一个
[1] Edward J. McCluskey,et al. PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).
[2] Samuel Williams,et al. Auto-tuning performance on multicore computers , 2008 .
[3] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[4] Jonathan Chang,et al. A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.
[5] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[6] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[7] Norman P. Jouppi,et al. Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[8] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.
[9] Christopher J. Hughes,et al. Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.
[10] David A. Padua,et al. Hierarchically tiled arrays for parallelism and locality , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[11] James R. Goodman,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[12] Maged M. Michael,et al. Design and performance of directory caches for scalable shared memory multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[13] Justin P. Haldar,et al. Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..
[14] Randi J. Rost. OpenGL shading language , 2004 .
[15] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.
[16] Milo M. K. Martin,et al. Token tenure: PATCHing token counting using directory-based cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[17] Steven K. Reinhardt,et al. A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[18] David K. McAllister,et al. Fast matrix multiplies using graphics hardware , 2001, SC.
[19] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.
[20] Krisztián Flautner,et al. Evolution of thread-level parallelism in desktop applications , 2010, ISCA.
[21] Sanjay J. Patel,et al. Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[22] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.
[23] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .
[24] Gregory Francis Pfister,et al. In search of clusters (2nd ed.) , 1998 .
[25] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[26] Daniel Gajski,et al. CEDAR: a large scale multiprocessor , 1983, CARN.
[27] Eric A. Brewer,et al. How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.
[28] Rida A. Bazzi,et al. The power of processor consistency , 1993, SPAA '93.
[29] A. Gupta,et al. The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.
[30] Hong Jiang,et al. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[31] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[32] Mikko H. Lipasti,et al. Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[33] Guy E. Blelloch,et al. Scans as Primitive Parallel Operations , 1989, ICPP.
[34] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.
[35] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.
[36] Yen-Kuang Chen,et al. The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..
[37] Yale N. Patt,et al. The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[38] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.
[39] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.
[40] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[41] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..
[42] Michael Gschwind. Chip multiprocessing and the cell broadband engine , 2006, CF '06.
[43] Eftychios Sifakis,et al. Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.
[44] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, ISCA.
[45] J. Larus,et al. Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.
[46] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[47] Timothy G. Mattson,et al. Patterns for parallel programming , 2004 .
[48] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.
[49] A.R. Newton,et al. An empirical evaluation of two memory-efficient directory methods , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[50] Mark Horowitz,et al. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.
[51] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.
[52] William J. Dally,et al. Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.
[53] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[54] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.
[55] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.
[56] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.
[57] Jesse M. Draper,et al. Distributed data access in AC , 1995, PPOPP '95.
[58] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[59] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.
[60] Hanspeter Mössenböck,et al. Design of the Java HotSpot#8482; client compiler for Java 6 , 2008, TACO.
[61] W. Daniel Hillis,et al. The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..
[62] Li Fan,et al. Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.
[63] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[64] S. Sathiya Keerthi,et al. A fast procedure for computing the distance between complex objects in three space , 1987, Proceedings. 1987 IEEE International Conference on Robotics and Automation.
[65] James R. Larus,et al. Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.
[66] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .
[67] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.
[68] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[69] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[70] Wolf-Dietrich Weber,et al. Power provisioning for a warehouse-sized computer , 2007, ISCA '07.
[71] Stefan Rusu,et al. A 45nm 8-core enterprise Xeon ® processor , 2009 .
[72] Erik Hagersten,et al. WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[73] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.
[74] Laxmikant V. Kalé,et al. MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.
[75] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[76] Norman P. Jouppi,et al. Heterogeneous chip multiprocessors , 2005, Computer.
[77] Samuel Naffziger,et al. Multi-Threaded Itanium®-Family Processor , 2005 .
[78] Coniferous softwood. GENERAL TERMS , 2003 .
[79] Andreas Moshovos. RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[80] Sanjay J. Patel,et al. Implementing a GPU programming model on a Non-GPU accelerator architecture , 2010, ISCA'10.
[81] Krste Asanovic,et al. Mondrian memory protection , 2002, ASPLOS X.
[82] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .
[83] Katherine Yelick,et al. Introduction to UPC and Language Specification , 2000 .
[84] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.
[85] Sanjay J. Patel,et al. A Task-Centric Memory Model for Scalable Accelerator Architectures , 2010, IEEE Micro.
[86] Corporate. IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992 , 1993 .
[87] Sarita V. Adve,et al. DeNovo: Rethinking Hardware for Disciplined Parallelism , 2010 .
[88] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .
[89] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[90] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[91] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[92] S. J. Vaughn-Nichols. Vendors Draw up a New Graphics-Hardware Approach , 2009 .
[93] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.
[94] M. Golden,et al. A 2.6GHz Dual-Core 64bx86 Microprocessor with DDR2 Memory Support , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.
[95] William E. Lorensen,et al. Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.
[96] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.
[97] James Gray. The roadrunner supercomputer: a petaflop's no problem , 2008 .
[98] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[99] Steven S. Lumetta,et al. HybridOS: runtime support for reconfigurable accelerators , 2008, FPGA '08.
[100] Kourosh Gharachorloo,et al. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.
[101] P. K. Dubey,et al. Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .
[102] Jonatthan Dougllas. Intel 8×× seriies and paxville xeon-MP microprocessors , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).
[103] Roy Friedman,et al. Implementing hybrid consistency with high-level synchronization operations , 1993, PODC '93.
[104] Ha Pham,et al. A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).
[105] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[106] David A. Wood,et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[107] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .
[108] Brian N. Bershad,et al. The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.
[109] Mary K. Vernon,et al. Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.
[110] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.
[111] Anoop Gupta,et al. Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.
[112] Avi Mendelson,et al. Programming model for a heterogeneous x86 platform , 2009, PLDI '09.
[113] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[114] Anant Agarwal,et al. Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.
[115] Jaejin Lee,et al. Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[116] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[117] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.
[118] James R. Goodman,et al. Cache Consistency and Sequential Consistency , 1991 .
[119] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.
[120] Chuanjun Zhang. Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[121] Liviu Iftode,et al. Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.
[122] Willy Zwaenepoel,et al. Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.
[123] Jimmy Su,et al. Making Sequential Consistency Practical in Titanium , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[124] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .