Aggressive compiler optimization and parallelization with thread-level speculation

We present a technique that exploits close collaboration between the compiler and the speculative multithreaded hardware to explore aggressive optimizations and parallelization for scalar programs. The compiler aggressively optimizes the frequently executed code in user programs by predicting an execution path or the values of long-latency instructions. Based on the predicted hot execution path, the compiler forms regions of greatly simplified data and control flow graphs and then performs aggressive optimizations on the formed regions. Thread level speculation (TLS) helps expose program parallelism and guarantees program correctness when the prediction is incorrect. With the collaboration of compilers and speculative multithreaded support, the program performance can be significantly improved. The preliminary results with simple trace regions demonstrate that the performance gain on dynamic compiler schedule cycles can be 33% for some benchmark and about 10%, on the average, for all the eight SpecInt95 benchmarks. For SpecInt2k, the performance gain is up to 23% with the conservative execution model. With a cycle accurate simulator with the conservative execution model, the overall performance gain by considering runtime factors (e.g., cache misses and branch misprediction) for vortex and m88ksim is 12% and 14.7%, respectively. The performance gain can be higher with more sophisticated region formation and region-based optimizations

[1]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Program Optimization , 1999 .

[2]  Rajiv Gupta,et al.  Path profile guided partial dead code elimination using predication , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[4]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[5]  James R. Larus,et al.  Optimally profiling and tracing programs , 1994, TOPL.

[6]  James R. Larus,et al.  Optimally Profiling and Tracing , 1994 .

[7]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[8]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[9]  Satish Rao Faster algorithms for finding small edge cuts in planar graphs , 1992, STOC '92.

[10]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[11]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[12]  Richard E. Hank,et al.  Region-based compilation: an introduction and motivation , 1995, MICRO 1995.

[13]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  C. Zilles,et al.  Understanding the backward slices of performance degrading instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).