论文信息 - Scalable Speculative Parallelization on Commodity Clusters

Scalable Speculative Parallelization on Commodity Clusters

While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49x. This compares favorably to the 15x speedup achieved by our implementation of TLS-only support for clusters.

[1] Hsien-Hsin S. Lee,et al. Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3] Ron Cytron,et al. Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[4] Antonia Zhai,et al. Compiler optimization of value communication for thread-level speculation , 2005 .

[5] Antonia Zhai,et al. A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6] Josep Torrellas,et al. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7] Matthew J. Bridges,et al. The velocity compiler: extracting efficient multicore execution from legacy sequential codes , 2008 .

[8] Gurindar S. Sohi,et al. Master/slave speculative parallelization , 2002, MICRO.

[9] Guilherme Ottoni,et al. Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10] Alan Mycroft,et al. Software thread-level speculation: an optimistic library implementation , 2008, IWMSE '08.

[11] 장훈,et al. [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[12] Madalin Mihailescu,et al. Exploiting distributed version concurrency in a transactional memory cluster , 2006, PPoPP '06.

[13] Luís E. T. Rodrigues,et al. D2STM: Dependable Distributed Software Transactional Memory , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[14] David I. August,et al. Intelligent speculation for pipelined multithreading , 2008 .

[15] Arun Raman,et al. Speculative parallelization using software multi-threaded transactions , 2010, ASPLOS XV.

[16] Yun Zhang,et al. Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17] William Thies,et al. A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18] Antonia Zhai,et al. The STAMPede approach to thread-level speculation , 2005, TOCS.

[19] Katherine Yelick,et al. UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing) , 2005 .

[20] Timothy Mattson,et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[21] Guilherme Ottoni,et al. Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[22] Rajiv Gupta,et al. Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[23] Katherine Yelick,et al. Titanium: a high-performance Java dialect , 1998 .

[24] Yun Zhang,et al. Decoupled software pipelining creates parallelization opportunities , 2010, CGO '10.

[25] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27] H. Lee,et al. Helper Transactions : Enabling Thread-Level Speculation via A Transactional Memory System , 2008 .

[28] Easwaran Raman,et al. Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[29] Mikel Luján,et al. DiSTM: A Software Transactional Memory Framework for Clusters , 2008, 2008 37th International Conference on Parallel Processing.

[30] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .

[31] Scott A. Mahlke,et al. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory , 2009, PLDI '09.

[32] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33] David A. Patterson,et al. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[34] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[35] Katherine Yelick,et al. UPC: Distributed Shared-Memory Programming , 2003 .

[36] Guillaume Mercier,et al. Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[37] Bradford L. Chamberlain,et al. Software transactional memory for large scale clusters , 2008, PPoPP.

[38] Scott A. Mahlke,et al. Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.