Automatic Parallelization of Sequential Programs

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then speculatively executing those future computations. Although that prior work demonstrated scaling, it did not demonstrate speedup, because it ran entirely in emulation. We took this as a challenge to construct a hardware prototype that embodies the ideas of ASC, but works on a broader range of programs and runs natively on hardware. The resulting system is similar in spirit to the original work, but differs in practically every respect. We present an implementation of the ASC architecture that runs natively on x86 hardware and achieves near-linear speedup up to 44-cores (the size of our test platform) for several classes of programs, such as computational kernels, map-style programs, and matrix operations. We observe that programs are either completely predictable, achieving near-perfect predictive accuracy, or totally unpredictable, and therefore not amenable to scaling via ASC-like techniques. We also find that in most cases, speedup is limited only by implementation details: the overhead of our dependency tracking infrastructure and the manipulation of large state spaces. We are able to automatically parallelize programs with linked data structures that are not amenable to other forms of automatic parallelization.

[1]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[2]  Ken Kennedy,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, SC.

[3]  Peter Kraft,et al.  Automatically Scalable Computation That Is More Scalable and Automatic , 2017 .

[4]  Gu-Yeon Wei,et al.  The HELIX project: Overview and directions , 2012, DAC Design Automation Conference 2012.

[5]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[6]  Lawrence Rauchwerger,et al.  Hybrid Analysis: Static & Dynamic Memory Reference Analysis , 2004, International Journal of Parallel Programming.

[7]  Kunle Olukotun,et al.  Runtime automatic speculative parallelization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  Kevin Skadron,et al.  Feasibility of Dynamic Binary Parallelization , 2011 .

[9]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[11]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[12]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[13]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[14]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[15]  Wei Liu,et al.  Dynamic parallelization of single-threaded binary programs using speculative slicing , 2009, ICS.

[16]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[17]  Arun Raman,et al.  Speculative parallelization using software multi-threaded transactions , 2010, ASPLOS XV.

[18]  Margo I. Seltzer,et al.  ASC: automatically scalable computation , 2014, ASPLOS.

[19]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[20]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[21]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[22]  Michael Franz,et al.  Dynamic parallelization and mapping of binary executables on hierarchical platforms , 2006, CF '06.

[23]  Rajeev Barua,et al.  Automatic Parallelization in a Binary Rewriter , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.