Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM2), establishes a new design point. On the one hand, EM2 supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM2 chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM2 is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months.

[1]  Srinivas Devadas,et al.  Deadlock-free fine-grained thread migration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[2]  Thomas M. Conte,et al.  Manager-client pairing: A framework for implementing coherence hierarchies , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[4]  Srinivas Devadas,et al.  DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION , 2011 .

[5]  Erik Hagersten,et al.  TMA: a trap-based memory architecture , 2006, ICS '06.

[6]  Daniel E. Lenoski,et al.  Scalable Shared-Memory Multiprocessing , 1995 .

[7]  James C. Hoe,et al.  Synthesis of operation-centric hardware descriptions , 2000, IEEE/ACM International Conference on Computer Aided Design. ICCAD - 2000. IEEE/ACM Digest of Technical Papers (Cat. No.00CH37140).

[8]  Srinivas Devadas,et al.  Hardware-level thread migration in a 110-core shared-memory multiprocessor , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[9]  Meng Zhang,et al.  Fractal Coherence: Scalably Verifiable Cache Coherence , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Srinivas Devadas,et al.  The Execution Migration Machine , 2013 .

[11]  Marcelo Cintra,et al.  An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[12]  Leslie Lamport,et al.  Checking Cache-Coherence Protocols with TLA+ , 2003, Formal Methods Syst. Des..

[13]  Valeria Bertacco,et al.  Post-silicon verification for cache coherence , 2008, 2008 IEEE International Conference on Computer Design.

[14]  D. Banks,et al.  Assembly and Packaging , 2006 .

[15]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[17]  Arvind,et al.  High-level synthesis: an essential ingredient for designing complex ASICs , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[18]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Srinivas Devadas,et al.  Thread Migration Prediction for Distributed Shared Caches , 2014, IEEE Computer Architecture Letters.

[20]  Arvind,et al.  Getting Formal Verification into Design Flow , 2008, FM.

[21]  David J. Lilja,et al.  So many states, so little time: verifying memory coherence in the Cray X1 , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[22]  James C. Hoe,et al.  Scheduling and Synthesis of Operation-Centric Hardware Descriptions , 2005 .