Scalable Speculative Parallelization on Commodity Clusters

While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49x. This compares favorably to the 15x speedup achieved by our implementation of TLS-only support for clusters.

[1]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[4]  Antonia Zhai,et al.  Compiler optimization of value communication for thread-level speculation , 2005 .

[5]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Josep Torrellas,et al.  Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  Matthew J. Bridges,et al.  The velocity compiler: extracting efficient multicore execution from legacy sequential codes , 2008 .

[8]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization , 2002, MICRO.

[9]  Guilherme Ottoni,et al.  Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Alan Mycroft,et al.  Software thread-level speculation: an optimistic library implementation , 2008, IWMSE '08.

[11]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[12]  Madalin Mihailescu,et al.  Exploiting distributed version concurrency in a transactional memory cluster , 2006, PPoPP '06.

[13]  Luís E. T. Rodrigues,et al.  D2STM: Dependable Distributed Software Transactional Memory , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[14]  David I. August,et al.  Intelligent speculation for pipelined multithreading , 2008 .

[15]  Arun Raman,et al.  Speculative parallelization using software multi-threaded transactions , 2010, ASPLOS XV.

[16]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[19]  Katherine Yelick,et al.  UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing) , 2005 .

[20]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[21]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[22]  Rajiv Gupta,et al.  Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[23]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[24]  Yun Zhang,et al.  Decoupled software pipelining creates parallelization opportunities , 2010, CGO '10.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27]  H. Lee,et al.  Helper Transactions : Enabling Thread-Level Speculation via A Transactional Memory System , 2008 .

[28]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[29]  Mikel Luján,et al.  DiSTM: A Software Transactional Memory Framework for Clusters , 2008, 2008 37th International Conference on Parallel Processing.

[30]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[31]  Scott A. Mahlke,et al.  Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory , 2009, PLDI '09.

[32]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[34]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[35]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[36]  Guillaume Mercier,et al.  Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[37]  Bradford L. Chamberlain,et al.  Software transactional memory for large scale clusters , 2008, PPoPP.

[38]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.