Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory

Multicore designs have emerged as the mainstream design paradigm for the microprocessor industry. Unfortunately, providing multiple cores does not directly translate into performance for most applications. The industry has already fallen short of the decades-old performance trend of doubling performance every 18 months. An attractive approach for exploiting multiple cores is to rely on tools, both compilers and runtime optimizers, to automatically extract threads from sequential applications. However, despite decades of research on automatic parallelization, most techniques are only effective in the scientific and data parallel domains where array dominated codes can be precisely analyzed by the compiler. Thread-level speculation offers the opportunity to expand parallelization to general-purpose programs, but at the cost of expensive hardware support. In this paper, we focus on providing low-overhead software support for exploiting speculative parallelism. We propose STMlite, a light-weight software transactional memory model that is customized to facilitate profile-guided automatic loop parallelization. STMlite eliminates a considerable amount of checking and locking overhead in conventional software transactional memory models by decoupling the commit phase from main transaction execution. Further, strong atomicity requirements for generic transactional memories are unnecessary within a stylized automatic parallelization framework. STMlite enables sequential applications to extract meaningful performance gains on commodity multicore hardware.

[1]  Maged M. Michael,et al.  RingSTM: scalable transactions with a single atomic instruction , 2008, SPAA '08.

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Michael F. Spear,et al.  Conflict Detection and Validation Strategies for Software Transactional Memory , 2006, DISC.

[4]  Brian T. Lewis,et al.  Compiler and runtime support for efficient software transactional memory , 2006, PLDI '06.

[5]  Ken Kennedy,et al.  The ParaScope parallel programming environment , 1993, Proc. IEEE.

[6]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[7]  Hong-Seok Kim,et al.  Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis , 2004, SAS.

[8]  Mark Plesko,et al.  Optimizing memory transactions , 2006, PLDI '06.

[9]  Michael L. Scott,et al.  Flexible Decoupled Transactional Memory Support , 2008, 2008 International Symposium on Computer Architecture.

[10]  Matthew L. Seidl,et al.  Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.

[11]  Maurice Herlihy,et al.  The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, Lock-Free Data Structures , 2002, DISC.

[12]  Matthew I. Frank,et al.  SUDS: automatic parallelization for raw processors , 2003 .

[13]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[14]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[15]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[16]  Kunle Olukotun,et al.  An effective hybrid transactional memory system with strong isolation guarantees , 2007, ISCA '07.

[17]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[18]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[19]  Kunle Olukotun,et al.  The Atomos transactional programming language , 2006, PLDI '06.

[20]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[22]  Antonio González,et al.  Thread-spawning schemes for speculative multithreading , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[23]  Keir Fraser,et al.  Language support for lightweight transactions , 2003, SIGP.

[24]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[25]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[26]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[27]  Quinn Jacobson,et al.  Architectural Support for Software Transactional Memory , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[30]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[31]  Martín Abadi,et al.  Transactional memory with strong atomicity using off-the-shelf memory protection hardware , 2009, PPoPP '09.

[32]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[33]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[34]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[35]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[36]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[37]  Dan Grossman,et al.  Enforcing isolation and ordering in STM , 2007, PLDI '07.

[38]  Tatiana Shpeisman,et al.  Dynamic optimization for efficient strong atomicity , 2008, OOPSLA.

[39]  Kunle Olukotun,et al.  Exploiting method-level parallelism in single-threaded Java programs , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[40]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[41]  Nir Shavit,et al.  Understanding Tradeoffs in Software Transactional Memory , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[42]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[43]  Virendra J. Marathe,et al.  Adaptive Software Transactional Memory , 2005, DISC.

[44]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.