Dissecting sequential programs for parallelization—An approach based on computational units

When trying to parallelize a sequential program, programmers routinely struggle during the first step: finding out which code sections can be made to run in parallel. While identifying such code sections, most of the current parallelism discovery techniques focus on specific language constructs. In contrast, we propose to concentrate on the computations performed by a program. In our approach, a program is treated as a collection of computations communicating with one another using a number of variables. Each computation is represented as a computational unit (CU). A CU contains the inputs and outputs of a computation, and the three phases of a computation are read, compute, and write. Based on the notion of CU, which ensures that the read phase executes before the write phase, we present a unified framework to identify both loop parallelism and task parallelism in sequential programs. We conducted a range of experiments on 23 applications from four different benchmark suites. Our approach accurately identified the parallelization opportunities in benchmark applications based on comparison with their parallel versions. We have also parallelized the opportunities identified by our approach that were not implemented in the parallel versions of the benchmarks and reported the speedup.

[1]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[2]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[3]  Zhen Li,et al.  Unveiling parallelization opportunities in sequential programs , 2016, J. Syst. Softw..

[4]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[5]  David I. August,et al.  Automatically exploiting cross-invocation parallelism using runtime information , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Zhen Li,et al.  Discovery of Potential Parallelism in Sequential Programs , 2013, 2013 42nd International Conference on Parallel Processing.

[7]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[8]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[9]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[10]  Koen De Bosschere,et al.  A profile-based tool for finding pipeline parallelism in sequential programs , 2010, Parallel Comput..

[11]  Ian T. Foster,et al.  Compiler Techniques for Massively Scalable Implicit Task Parallelism , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Felix Wolf,et al.  Using Template Matching to Infer Parallel Design Patterns , 2015, ACM Trans. Archit. Code Optim..

[13]  Wilson C. Hsieh,et al.  Automatic generation of nested, fork-join parallelism , 2004, The Journal of Supercomputing.

[14]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[15]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[16]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Zhen Li,et al.  An Efficient Data-Dependence Profiler for Sequential and Parallel Programs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[18]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[19]  Rainer Leupers,et al.  MAPS: An integrated framework for MPSoC application parallelization , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[20]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[21]  P. O. Bobbie Partitioning programs for parallel execution : A case study in the Intel iPSC/2 environment , 1997 .

[22]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[23]  Chi Ching Chi,et al.  A Benchmark Suite for Evaluating Parallel Programming Models: Introduction and Preliminary Results , 2011 .

[24]  Felix Wolf,et al.  Brief Announcement: Meeting the Challenges of Parallelizing Sequential Programs , 2017, SPAA.

[25]  Chi Ching Chi,et al.  A Benchmark Suite for Evaluating Parallel Programming Models , 2011 .

[26]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27]  Felix Wolf,et al.  Automatic Parallel Pattern Detection in the Algorithm Structure Design Space , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[29]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[30]  Philippe Clauss,et al.  Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[32]  Felix Wolf,et al.  The Basic Building Blocks of Parallel Tasks , 2015, COSMIC@CGO.