Flexible Support for Fast Parallel Commutative Updates

Privatizing data is a useful strategy for increasing parallelism in a shared memory multithreaded program. Independent cores can compute independently on duplicates of shared data, combining their results at the end of their computations. Conventional approaches to privatization, however, rely on explicit static or dynamic memory allocation for duplicated state, increasing memory footprint and contention for cache resources, especially in shared caches. In this work, we describe CCache, a system for on-demand privatization of data manipulated by commutative operations. CCache garners the benefits of privatization, without the increase in memory footprint or cache occupancy. Each core in CCache dynamically privatizes commutatively manipulated data, operating on a copy. Periodically or at the end of its computation, the core merges its value with the value resident in memory, and when all cores have merged, the in-memory copy contains the up-to-date value. We describe a low-complexity architectural implementation of CCache that extends a conventional multicore to support on-demand privatization without using additional memory for private copies. We evaluate CCache on several high-value applications, including random access key-value store, clustering, breadth first search and graph ranking, showing speedups upto 3.2X.

[1]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[2]  Sen Hu,et al.  Efficient system-enforced deterministic parallelism , 2010, OSDI.

[3]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[4]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[5]  Sebastian Burckhardt,et al.  Concurrent programming with revisions and isolation types , 2010, OOPSLA.

[6]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[7]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[8]  Sebastian Burckhardt,et al.  Two for the price of one: a model for parallel and incremental computation , 2011, OOPSLA '11.

[9]  Keir Fraser,et al.  Concurrent programming without locks , 2007, TOCS.

[10]  Brandon Lucia,et al.  Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races , 2010, ISCA.

[11]  Michael F. Spear,et al.  An integrated hardware-software approach to flexible transactional memory , 2007, ISCA '07.

[12]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS XV.

[13]  Swarnendu Biswas,et al.  Valor: efficient, software-only region conflict exceptions , 2015, OOPSLA.

[14]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[16]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[17]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[19]  Daniel Sánchez,et al.  Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Yun Zhang,et al.  Commutative set: a language extension for implicit parallel programming , 2011, PLDI '11.

[21]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[22]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[23]  E. Berger,et al.  Grace: Safe and Efficient Concurrent Programming , 2008 .

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Milo M. K. Martin,et al.  RETCON: transactional repair without replay , 2010, ISCA '10.

[26]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[27]  Zhiyuan Li,et al.  General data structure expansion for multi-threading , 2013, PLDI.

[28]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[29]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[30]  Satish Narayanasamy,et al.  DRFX: a simple and efficient memory model for concurrent programming languages , 2010, PLDI '10.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[33]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[34]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[35]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[36]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[37]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.