论文信息 - Hybrid coherence for scalable multicore architectures

Hybrid coherence for scalable multicore architectures

This dissertation describes a cache architecture and memory model for 1000core microprocessors. Our approach exploits workload characteristics and programming model assumptions to build a hybrid memory model that incorporates features from both software-managed coherence schemes and hardware-managed cache coherence. The goal is to achieve the scalability found in compute accelerators, which support relaxed ordering of memory operations and programmermanaged coherence, while providing a programming interface that is akin to the strongly ordered cache coherent memory models found in general-purpose multicore processors today. The research presented in this dissertation supports the following thesis: To be scalable and programmable, future multicore systems require a cached, singleaddress space memory hierarchy. A hybrid software and hardware approach to coherence management is required to support such a memory hierarchy in 1000core processors and is achievable only by leveraging the characteristics of target applications and system software. We motivate a hybrid memory model and present our approach to addressing the challenges facing such a model. We discuss and evaluate a scalable 1024core architecture, workloads that we see as targets for such an architecture, a memory model that relies on software management of coherence, and scalable hardware coherence schemes. Using these components, we develop the software and hardware support for a hybrid memory model. We demonstrate that our techniques can be used to reduce hardware design complexity, to increase software scalability, or to combine the two.

J. H. Kelm | John Henry Kelm

[1] Edward J. McCluskey,et al. PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[2] Samuel Williams,et al. Auto-tuning performance on multicore computers , 2008 .

[3] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4] Jonathan Chang,et al. A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[5] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[6] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[7] Norman P. Jouppi,et al. Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[9] Christopher J. Hughes,et al. Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[10] David A. Padua,et al. Hierarchically tiled arrays for parallelism and locality , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[11] James R. Goodman,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[12] Maged M. Michael,et al. Design and performance of directory caches for scalable shared memory multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13] Justin P. Haldar,et al. Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[14] Randi J. Rost. OpenGL shading language , 2004 .

[15] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[16] Milo M. K. Martin,et al. Token tenure: PATCHing token counting using directory-based cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[17] Steven K. Reinhardt,et al. A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18] David K. McAllister,et al. Fast matrix multiplies using graphics hardware , 2001, SC.

[19] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[20] Krisztián Flautner,et al. Evolution of thread-level parallelism in desktop applications , 2010, ISCA.

[21] Sanjay J. Patel,et al. Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[22] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.

[23] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[24] Gregory Francis Pfister,et al. In search of clusters (2nd ed.) , 1998 .

[25] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[26] Daniel Gajski,et al. CEDAR: a large scale multiprocessor , 1983, CARN.

[27] Eric A. Brewer,et al. How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.

[28] Rida A. Bazzi,et al. The power of processor consistency , 1993, SPAA '93.

[29] A. Gupta,et al. The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[30] Hong Jiang,et al. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32] Mikko H. Lipasti,et al. Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33] Guy E. Blelloch,et al. Scans as Primitive Parallel Operations , 1989, ICPP.

[34] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[35] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[36] Yen-Kuang Chen,et al. The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[37] Yale N. Patt,et al. The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[38] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[39] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[40] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[41] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[42] Michael Gschwind. Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[43] Eftychios Sifakis,et al. Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.

[44] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, ISCA.

[45] J. Larus,et al. Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[46] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[47] Timothy G. Mattson,et al. Patterns for parallel programming , 2004 .

[48] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[49] A.R. Newton,et al. An empirical evaluation of two memory-efficient directory methods , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[50] Mark Horowitz,et al. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[51] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[52] William J. Dally,et al. Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[53] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[54] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[55] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[56] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[57] Jesse M. Draper,et al. Distributed data access in AC , 1995, PPOPP '95.

[58] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[59] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[60] Hanspeter Mössenböck,et al. Design of the Java HotSpot#8482; client compiler for Java 6 , 2008, TACO.

[61] W. Daniel Hillis,et al. The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[62] Li Fan,et al. Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[63] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[64] S. Sathiya Keerthi,et al. A fast procedure for computing the distance between complex objects in three space , 1987, Proceedings. 1987 IEEE International Conference on Robotics and Automation.

[65] James R. Larus,et al. Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[66] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[67] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[68] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[69] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[70] Wolf-Dietrich Weber,et al. Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[71] Stefan Rusu,et al. A 45nm 8-core enterprise Xeon ® processor , 2009 .

[72] Erik Hagersten,et al. WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[73] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.

[74] Laxmikant V. Kalé,et al. MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.

[75] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[76] Norman P. Jouppi,et al. Heterogeneous chip multiprocessors , 2005, Computer.

[77] Samuel Naffziger,et al. Multi-Threaded Itanium®-Family Processor , 2005 .

[78] Coniferous softwood. GENERAL TERMS , 2003 .

[79] Andreas Moshovos. RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[80] Sanjay J. Patel,et al. Implementing a GPU programming model on a Non-GPU accelerator architecture , 2010, ISCA'10.

[81] Krste Asanovic,et al. Mondrian memory protection , 2002, ASPLOS X.

[82] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .

[83] Katherine Yelick,et al. Introduction to UPC and Language Specification , 2000 .

[84] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[85] Sanjay J. Patel,et al. A Task-Centric Memory Model for Scalable Accelerator Architectures , 2010, IEEE Micro.

[86] Corporate. IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992 , 1993 .

[87] Sarita V. Adve,et al. DeNovo: Rethinking Hardware for Disciplined Parallelism , 2010 .

[88] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[89] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[90] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[91] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[92] S. J. Vaughn-Nichols. Vendors Draw up a New Graphics-Hardware Approach , 2009 .

[93] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[94] M. Golden,et al. A 2.6GHz Dual-Core 64bx86 Microprocessor with DDR2 Memory Support , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[95] William E. Lorensen,et al. Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[96] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[97] James Gray. The roadrunner supercomputer: a petaflop's no problem , 2008 .

[98] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[99] Steven S. Lumetta,et al. HybridOS: runtime support for reconfigurable accelerators , 2008, FPGA '08.

[100] Kourosh Gharachorloo,et al. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[101] P. K. Dubey,et al. Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[102] Jonatthan Dougllas. Intel 8×× seriies and paxville xeon-MP microprocessors , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).

[103] Roy Friedman,et al. Implementing hybrid consistency with high-level synchronization operations , 1993, PODC '93.

[104] Ha Pham,et al. A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[105] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[106] David A. Wood,et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[107] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .

[108] Brian N. Bershad,et al. The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[109] Mary K. Vernon,et al. Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[110] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[111] Anoop Gupta,et al. Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[112] Avi Mendelson,et al. Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[113] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[114] Anant Agarwal,et al. Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[115] Jaejin Lee,et al. Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[116] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[117] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[118] James R. Goodman,et al. Cache Consistency and Sequential Consistency , 1991 .

[119] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[120] Chuanjun Zhang. Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[121] Liviu Iftode,et al. Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[122] Willy Zwaenepoel,et al. Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[123] Jimmy Su,et al. Making Sequential Consistency Practical in Titanium , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[124] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .