Compiler optimization of scalar value communication between speculative threads

While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2-28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.

[1]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[2]  Ding-Kai Chen Ding-Kai Chen,et al.  Statement Re-ordering for DOACROSS Loops , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[3]  Thomas E. Cheatham,et al.  Editorial: Program Transformations , 1981, IEEE Trans. Software Eng..

[4]  Barry K. Rosen,et al.  Qualified Data Flow Problems , 1981, IEEE Trans. Software Eng..

[5]  Manish Gupta,et al.  Techniques for Speculative Run-Time Parallelization of Loops , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[7]  Alexandru Nicolau,et al.  Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies , 1989, IEEE Trans. Computers.

[8]  Gurindar S. Sohi,et al.  Compiling for the multiscalar architecture , 1998 .

[9]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[10]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[11]  Antonio González,et al.  Value prediction for speculative multithreaded architectures , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[13]  Scott Mahlke,et al.  Three Superblock Scheduling Models for Superscalar and Superpipelined Processors , 1991 .

[14]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[16]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[17]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[18]  Vivek Sarkar,et al.  Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.

[19]  Josep Torrellas,et al.  The need for fast communication in hardware-based speculative chip multiprocessors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[20]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[21]  Steven W. K. Tjiang,et al.  Languages and Compilers for Parallel Computing , 1997, Lecture Notes in Computer Science.

[22]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[23]  Bernhard Steffen,et al.  Lazy code motion , 1992, PLDI '92.

[24]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[25]  Antonia Zhai,et al.  Compiler optimizations to accelerate scalar value communication between speculative threads , 2002 .

[26]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization with Distilled Programs , 2001 .

[27]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[28]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[29]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[31]  Barry K. Rosen,et al.  Qualified Data Flow Problems , 1980, IEEE Transactions on Software Engineering.

[32]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[33]  James R. Larus,et al.  Improving data-flow analysis with path profiles (with retrospective) , 1998, PLDI 1998.

[34]  John A. Gregory,et al.  Architectural Support for Thread-Level Data Speculation , 1997 .

[35]  Antonia Zhai,et al.  Improving value communication for thread-level speculation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[36]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[37]  James R. Larus,et al.  Improving data-flow analysis with path profiles , 1998, PLDI.