Memory-mapping support for reducer hyperobjects

Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly --- roughly 12x overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3x overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4x faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.

[1]  Milo M. K. Martin,et al.  Deconstructing Transactional Semantics: The Subtleties of Atomicity , 2005 .

[2]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[3]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[4]  Robert D. Blumofe,et al.  Hood: A user-level threads library for multiprogrammed multiprocessors , 1998 .

[5]  I-Ting Angelina Lee,et al.  Memory abstractions for parallel programming , 2012 .

[6]  Hari K. Pyla,et al.  Avoiding deadlock avoidance , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[8]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[9]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[10]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[11]  Devang Shah,et al.  Implementing Lightweight Threads , 1992, USENIX Summer.

[12]  Silas Boyd-Wickizer,et al.  Using memory mapping to support cactus stacks in work-stealing runtime systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[14]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[15]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[16]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[17]  P MillerBarton,et al.  What are race conditions , 1992 .

[18]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[19]  Martín Abadi,et al.  Transactional memory with strong atomicity using off-the-shelf memory protection hardware , 2009, PPoPP '09.

[20]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[21]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[22]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[23]  Charles E. Leiserson,et al.  Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[24]  Yi Guo,et al.  The habanero multicore software research project , 2009, OOPSLA Companion.

[25]  Christoph von Praun,et al.  Race Conditions , 2011, Encyclopedia of Parallel Computing.

[26]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA '09.

[27]  Ieee Standards Board System application program interface (API) (C language) , 1990 .

[28]  B. Lampson,et al.  Authentication in distributed systems: theory and practice , 1991, TOCS.

[29]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[30]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.