Architectural support for thread communications in multi-core processors

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination's cache before it is needed, eliminating cache misses in the destination's cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.

[1]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[2]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Yonghong Song,et al.  Design and implementation of a compiler framework for helper threading on multi-core processors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[4]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[5]  Li Zhao,et al.  CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[6]  Scott A. Mahlke,et al.  Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[8]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[9]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[11]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[12]  Stephen F. Jenks,et al.  The Synchronized Pipelined Parallelism Model , 2004 .

[13]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[14]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[15]  James R. Larus,et al.  Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[16]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[17]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[18]  Eric Rotenberg,et al.  Slipstream execution mode for CMP-based multiprocessors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[19]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[20]  Michael J. Flynn,et al.  Producer-consumer communication in distributed shared memory multiprocessors , 1999, Proc. IEEE.

[21]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[22]  Josep Torrellas,et al.  Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs , 1998, ICS '98.

[23]  Josep Torrellas,et al.  Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[24]  Thomas F. Wenisch,et al.  Store-ordered streaming of shared memory , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[25]  Sevin Fide,et al.  Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors , 2008, IEEE Computer Architecture Letters.

[26]  Evgenia Smirni,et al.  The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.

[27]  Sevin Fide,et al.  Architecture optimizations for synchronization and communication on chip multiprocessors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[28]  T. N. Vijaykumar,et al.  Optimizing Replication, Communication, and Capacity Allocation in CMPs , 2005, ISCA 2005.

[29]  Lei Wang,et al.  Thread-Associative Memory for Multicore and Multithreaded Computing , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.