HELIX: automatic parallelization of irregular programs for chip multiprocessing

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

[1]  Pen-Chung Yew,et al.  Statement Re-ordering for DOACROSS Loops , 1994, ICPP.

[2]  Guilherme Ottoni,et al.  Performance scalability of decoupled software pipelining , 2008, TACO.

[3]  Yun Zhang,et al.  Decoupled software pipelining creates parallelization opportunities , 2010, CGO '10.

[4]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[5]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[6]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[7]  Easwaran Raman,et al.  Practical and accurate low-level pointer analysis , 2005, International Symposium on Code Generation and Optimization.

[8]  Alexander V. Veidenbaum,et al.  Synchronization optimizations for efficient execution on multi-cores , 2009, ICS '09.

[9]  Andrew W. Appel,et al.  Modern Compiler Implementation in Java, 2nd edition , 2002 .

[10]  Krishna M. Kavi,et al.  Parallelization of DOALL and DOACROSS Loops - A Survey , 1997, Adv. Comput..

[11]  FrankeBjörn,et al.  Towards a holistic approach to auto-parallelization , 2009 .

[12]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[13]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[15]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[16]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[17]  Krishna M. Kavi,et al.  A loop allocation policy for DOACROSS loops , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[18]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[19]  Pen-Chung Yew,et al.  On Effective Execution of Nonuniform DOACROSS Loops , 1996, IEEE Trans. Parallel Distributed Syst..

[20]  Feng Liu,et al.  Scalable Speculative Parallelization on Commodity Clusters , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Andrew W. Appel,et al.  Modern Compiler Implementation in Java , 1997 .

[22]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[23]  Alexandru Nicolau,et al.  Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.

[24]  Kunle Olukotun,et al.  Exposing speculative thread parallelism in SPEC2000 , 2005, PPoPP.

[25]  Yale N. Patt,et al.  Simultaneous subordinate microthreading , 2004 .

[26]  Pen-Chung Yew,et al.  Efficient Doacross execution on distributed shared-memory multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[27]  Ron Cytron Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[28]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  Xiaotong Zhuang,et al.  Exploiting Parallelism with Dependence-Aware Scheduling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[30]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[31]  Giovanni Agosta,et al.  A highly flexible, parallel virtual machine: design and experience of ILDJIT , 2010, Softw. Pract. Exp..

[32]  Arun Raman,et al.  Speculative parallelization using software multi-threaded transactions , 2010, ASPLOS XV.

[33]  Seung-Ju Jang,et al.  Spin-block synchronization algorithm in the shared memory multiprocessor system , 1994, OPSR.

[34]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[35]  Pen-Chung Yew,et al.  Redundant Synchronization Elimination for DOACROSS Loops , 1999, IEEE Trans. Parallel Distributed Syst..

[36]  Lawrence Rauchwerger,et al.  Speculative Parallelization of Partially Parallel Loops , 2000, LCR.

[37]  Cheng-Zhong Xu,et al.  Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences , 2001, IEEE Trans. Parallel Distributed Syst..

[38]  Ding-kai Chen Pen-chung Yew An Empirical Study on DOACROSS Loops , 1991 .

[39]  Minyi Guo,et al.  Optimal loop parallelization for maximizing iteration-level parallelism , 2009, CASES '09.

[40]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[41]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for the Multicore Era , 2008, IEEE Micro.

[42]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[43]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.