Efficient Fine-Grain Synchronization on a Multi-Core Chip Architecture: A Fresh Look

Multi-core chip architectures are becoming mainstream, pe rmitting increasing on-chip parallelism through hardware support for multithreading. Finegrain synchronization is essential to the effective utilization of the capacity provided by future hi g -performance multi-core architectures. However, there are also new challenges realizing such fine-g rain synchronization in large-scale multi-core chip architectures – such as the IBM Cyclops-64 c hip that contains more than 100 processing cores and employs a memory organization with explic itly addressable memory segments instead of data cache. This paper presents a fresh look at the challenges and propos es a scalable solution for fine-grain synchronization that efficiently enforces mutual exclusion andread-after-write data-dependencies between concurrent threads. Using the Cyclops-64 chip arch itecture as a case study, we illustrate how to use a small Synchronization State Buffer (SSB) associated with each memory bank to accelerate the fine-grain synchronization by recording and managin g the states of frequently synchronized data units with modest hardware extensions. We demonstrate the effectiveness and efficiency of the proposed solution. • For mutual exclusion: Using distributed fine-grain locking at each of the memory u nits, we avoid the unnecessary serialization of operations on diffe rent elements of the same concurrent data structure and achieve this goal efficiently. • For read-after-write data-dependencies synchronization: our method encourages the exploration of do-across style of loop-level parallelism where loop-carried data dependencies can often be directly implemented by the application of the finegrain synchronization operations and the removal of useless barriers. The experimental results demonstrate significant performa nce gain due to the use of the above fine-grain synchronization solutions.

[1]  William J. Dally,et al.  The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[2]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[3]  Pen-Chung Yew,et al.  The impact of synchronization and granularity on parallel systems , 1990, ISCA '90.

[4]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[5]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[6]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[7]  P. Sadayappan,et al.  Removal of redundant dependences in DOACROSS loops with constant dependences , 1991, PPOPP '91.

[8]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9]  Pen-Chung Yew,et al.  The impact of synchronization and granularity on parallel systems , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  José E. Moreira,et al.  Demonstrating the scalability of a molecular dynamics application on a Petaflop computer , 2001, ICS '01.

[11]  Mark Moir,et al.  Universal Constructions for Large Objects , 1995, IEEE Trans. Parallel Distributed Syst..

[12]  Zhiyuan Li,et al.  A technique for reducing synchronization overhead in large scale multiprocessors , 1985, ISCA '85.

[13]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[14]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[15]  Maged M. Michael ABA Prevention Using Single-Word Instructions , 2004 .

[16]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[17]  Donald Yeung,et al.  Low-Cost Support for Fine-Grain Synchronization in Multiprocessors , 1992, Multithreaded Computer Architecture.

[18]  Ding-Kai Chen,et al.  Compiler optimizations for parallel loops with fine-grained synchronization , 1994 .

[19]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[20]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[21]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[22]  D. Burger,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  P. Sadayappan,et al.  Removal of Redundant Dependences in DOACROSS Loops with Constant Dependences , 1991, IEEE Trans. Parallel Distributed Syst..

[24]  Vincent J. Mooney,et al.  The System-on-a-Chip Lock Cache , 2004 .

[25]  Michael F. P. O'Boyle,et al.  Synchronization Minimization in a SPMD Execution Model , 1995, J. Parallel Distributed Comput..

[26]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[27]  Guang R. Gao,et al.  Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[28]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[29]  Collin McCurdy,et al.  User-controllable coherence for high performance shared memory multiprocessors , 2003, PPoPP '03.

[30]  William J. Dally,et al.  Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[31]  Yuan Zhang,et al.  Sequential Consistency Revisit: The Sufficient Condition and Method to Reason the Consistency Model of a Multiprocessor-on-a-Chip Architecture , 2005, Parallel and Distributed Computing and Networks.

[32]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[33]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[34]  José E. Moreira,et al.  Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer , 2002, International Journal of Parallel Programming.

[35]  Alan L. Cox,et al.  Optimally synchronizing DOACROSS loops on shared memory multiprocessors , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[36]  Mateo Valero,et al.  Proceedings of the 2nd conference on Computing frontiers , 2005, CF 2008.

[37]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[38]  Andris Padegs,et al.  Architecture of the IBM system/370 , 1978, CACM.

[39]  William J. Dally,et al.  The message-driven processor , 1992 .

[40]  G. Gao,et al.  FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[41]  Donald Yeung,et al.  Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient , 1993, PPOPP '93.

[42]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[43]  José E. Moreira,et al.  Evaluation of a multithreaded architecture for cellular computing , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[44]  Kunle Olukotun,et al.  Architectural Semantics for Practical Transactional Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[45]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.