Hybrid coherence for scalable multicore architectures

This dissertation describes a cache architecture and memory model for 1000core microprocessors. Our approach exploits workload characteristics and programming model assumptions to build a hybrid memory model that incorporates features from both software-managed coherence schemes and hardware-managed cache coherence. The goal is to achieve the scalability found in compute accelerators, which support relaxed ordering of memory operations and programmermanaged coherence, while providing a programming interface that is akin to the strongly ordered cache coherent memory models found in general-purpose multicore processors today. The research presented in this dissertation supports the following thesis: To be scalable and programmable, future multicore systems require a cached, singleaddress space memory hierarchy. A hybrid software and hardware approach to coherence management is required to support such a memory hierarchy in 1000core processors and is achievable only by leveraging the characteristics of target applications and system software. We motivate a hybrid memory model and present our approach to addressing the challenges facing such a model. We discuss and evaluate a scalable 1024core architecture, workloads that we see as targets for such an architecture, a memory model that relies on software management of coherence, and scalable hardware coherence schemes. Using these components, we develop the software and hardware support for a hybrid memory model. We demonstrate that our techniques can be used to reduce hardware design complexity, to increase software scalability, or to combine the two.

[1]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[2]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[3]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[5]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[6]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[7]  Norman P. Jouppi,et al.  Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[9]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[10]  David A. Padua,et al.  Hierarchically tiled arrays for parallelism and locality , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[11]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[12]  Maged M. Michael,et al.  Design and performance of directory caches for scalable shared memory multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[14]  Randi J. Rost OpenGL shading language , 2004 .

[15]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[16]  Milo M. K. Martin,et al.  Token tenure: PATCHing token counting using directory-based cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[17]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  David K. McAllister,et al.  Fast matrix multiplies using graphics hardware , 2001, SC.

[19]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[20]  Krisztián Flautner,et al.  Evolution of thread-level parallelism in desktop applications , 2010, ISCA.

[21]  Sanjay J. Patel,et al.  Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[22]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[23]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[24]  Gregory Francis Pfister,et al.  In search of clusters (2nd ed.) , 1998 .

[25]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[26]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[27]  Eric A. Brewer,et al.  How to get good performance from the CM-5 data network , 1994, Proceedings of 8th International Parallel Processing Symposium.

[28]  Rida A. Bazzi,et al.  The power of processor consistency , 1993, SPAA '93.

[29]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[30]  Hong Jiang,et al.  Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[34]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[35]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[36]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[37]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[38]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[39]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[40]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[41]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[42]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[43]  Eftychios Sifakis,et al.  Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.

[44]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[45]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[46]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[47]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[48]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[49]  A.R. Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[50]  Mark Horowitz,et al.  Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[51]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[52]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[53]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[54]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[55]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[56]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[57]  Jesse M. Draper,et al.  Distributed data access in AC , 1995, PPOPP '95.

[58]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[59]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[60]  Hanspeter Mössenböck,et al.  Design of the Java HotSpot#8482; client compiler for Java 6 , 2008, TACO.

[61]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[62]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[63]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[64]  S. Sathiya Keerthi,et al.  A fast procedure for computing the distance between complex objects in three space , 1987, Proceedings. 1987 IEEE International Conference on Robotics and Automation.

[65]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[66]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[67]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[68]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[69]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[70]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[71]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[72]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[73]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[74]  Laxmikant V. Kalé,et al.  MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.

[75]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[76]  Norman P. Jouppi,et al.  Heterogeneous chip multiprocessors , 2005, Computer.

[77]  Samuel Naffziger,et al.  Multi-Threaded Itanium®-Family Processor , 2005 .

[78]  Coniferous softwood GENERAL TERMS , 2003 .

[79]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[80]  Sanjay J. Patel,et al.  Implementing a GPU programming model on a Non-GPU accelerator architecture , 2010, ISCA'10.

[81]  Krste Asanovic,et al.  Mondrian memory protection , 2002, ASPLOS X.

[82]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[83]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[84]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[85]  Sanjay J. Patel,et al.  A Task-Centric Memory Model for Scalable Accelerator Architectures , 2010, IEEE Micro.

[86]  Corporate IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992 , 1993 .

[87]  Sarita V. Adve,et al.  DeNovo: Rethinking Hardware for Disciplined Parallelism , 2010 .

[88]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[89]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[90]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[91]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[92]  S. J. Vaughn-Nichols Vendors Draw up a New Graphics-Hardware Approach , 2009 .

[93]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[94]  M. Golden,et al.  A 2.6GHz Dual-Core 64bx86 Microprocessor with DDR2 Memory Support , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[95]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[96]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[97]  James Gray The roadrunner supercomputer: a petaflop's no problem , 2008 .

[98]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[99]  Steven S. Lumetta,et al.  HybridOS: runtime support for reconfigurable accelerators , 2008, FPGA '08.

[100]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[101]  P. K. Dubey,et al.  Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[102]  Jonatthan Dougllas Intel 8×× seriies and paxville xeon-MP microprocessors , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).

[103]  Roy Friedman,et al.  Implementing hybrid consistency with high-level synchronization operations , 1993, PODC '93.

[104]  Ha Pham,et al.  A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[105]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[106]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[107]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[108]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[109]  Mary K. Vernon,et al.  Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[110]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[111]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[112]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[113]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[114]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[115]  Jaejin Lee,et al.  Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[116]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[117]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[118]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[119]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[120]  Chuanjun Zhang Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[121]  Liviu Iftode,et al.  Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[122]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[123]  Jimmy Su,et al.  Making Sequential Consistency Practical in Titanium , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[124]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .