Speculative parallelization of partial reduction variables

Reduction variables are an important class of cross-thread dependence that can be parallelized by exploiting the associativity and commutativity of their operation. In this paper, we define a class of shared variables called partial reduction variables (PRV). These variables either cannot be proven to be reductions or they violate the requirements of a reduction variable in some way. We describe an algorithm that allows the compiler to detect PRVs, and we also discuss the necessary requirements to parallelize detected PRVs. Based on these requirements, we propose an implementation in a TLS system to parallelize PRVs that works by a combination of techniques at compile time and in the hardware. The compiler transforms the variable under the assumption that the reduction-like behavior proven statically will hold true at runtime. However, if a thread reads or updates the shared variable as a result of an alias or unlikely control path, a lightweight hardware mechanism will detect the access and synchronize it to ensure correct execution. We implement our compiler analysis and transformation in GCC, and analyze its potential on the SPEC CPU 2000 benchmarks.We find that supporting PRVs provides up to 46% performance gain over a highly optimized TLS system and on average 10.7% performance improvement.

[1]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[2]  Luis Ceze,et al.  Implicit parallelism with ordered transactions , 2007, PPoPP.

[3]  Nancy M. Amato,et al.  Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Antonia Zhai,et al.  Compiler optimization of scalar value communication between speculative threads , 2002, ASPLOS X.

[5]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[6]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[7]  Paul Feautrier,et al.  Scheduling reductions , 1994, ICS '94.

[8]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[9]  Wei-Ngan Chin,et al.  Deriving efficient parallel programs for complex recurrences , 1997, PASCO '97.

[10]  Kunle Olukotun,et al.  Exposing speculative thread parallelism in SPEC2000 , 2005, PPoPP.

[11]  Jan-Jan Wu An Interleaving Transformation for Parallelizing Reductions for Distributed-Memory Parallel Machines , 2004, The Journal of Supercomputing.

[12]  Jenn-Yuan Tsai,et al.  Compiler Techniques for the Superthreaded Architectures1, 2 , 1999, International Journal of Parallel Programming.

[13]  Michael Wolfe,et al.  Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[14]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[15]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[16]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[17]  Josep Torrellas,et al.  A Unified Approach to Speculative Parallelization of Loops in DSM Multiprocessors , 2007 .

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[19]  Lawrence Rauchwerger,et al.  The R-LRPD test: speculative parallelization of partially parallel loops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[20]  D. Callahan,et al.  Recognizing and Parallelizing Bounded Recurrences , 1991, LCPC.

[21]  Antonia Zhai,et al.  Compiler optimization of memory-resident value communication between speculative threads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[22]  Williams Ludwell Harrison,et al.  Automatic recognition of induction variables and recurrence relations by abstract interpretation , 1990, PLDI '90.

[23]  Antonia Zhai,et al.  Compiler optimizations for parallelizing general-purpose applications under thread-level speculation , 2008, PPOPP.

[24]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[25]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[26]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[27]  Diego Novillo Tree SSA A New Optimization Infrastructure for GCC , 2004 .

[28]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[29]  Manoj Franklin,et al.  A general compiler framework for speculative multithreading , 2002, SPAA '02.

[30]  Rudolf Eigenmann,et al.  Parallelization in the Presence of Generalized Induction and Reduction Variables , 1995 .

[31]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[32]  Jenq Kuen Lee,et al.  Compiler support for speculative multithreading architecture with probabilistic points-to analysis , 2003, PPoPP '03.

[33]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[34]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.