论文信息 - On the evaluation and extraction of thread-level parallelism in ordinary programs

On the evaluation and extraction of thread-level parallelism in ordinary programs

The need for high performance coupled with the increasing design complexity of modern processors and power and thermal constraints has led to the development of multi-cores systems. Examples of such systems include IBM/Toshiba's Cell processor, Intel's Core 2 Duo processor. One of the ways to exploit the hardware parallelism of such systems is via thread-level program parallelization. Although there has been a large amount of work done in the context of multithreading, the lack of detailed application characterization on real machines makes it difficult to assess the relevance and importance of the problems addressed in prior work and also of the practicality of the solutions proposed. To alleviate this limitation, we did a thorough analysis of ordinary programs, as represented by industry-standard SPEC benchmarks, on both IA-32 and IA-64 architectures to identify real performance bottlenecks. Based on the above and given that loops account for a large percentage of the total execution time in ordinary programs, we propose techniques for extracting thread-level parallelism (TLP) from both—DOALL and non-DOALL —type of loops. Extraction of TLP from DOALL loops entails efficient partitioning and mapping of a DOALL loop so as to achieve load balance between the different processors. In this regard, we present a general approach for partitioning nested DOALL loops, both perfect and non-perfect, with conditionals, with rectangular and non-rectangular iteration geometries, where the expressions in a conditional are affine functions of the outer loop indices. Non-DOALL loops can be parallelized either speculatively (TLS) or via explicit synchronization. Although TLS enables parallel execution of difficult-to-analyze (at compile time) program regions, its efficacy is limited by a wide variety of factors such as high misspeculation penalty and the need for additional hardware. This necessitates an evaluation of the performance potential of TLS. Using the Intel Fortran/C++ compiler, we show that the speedup achievable via TLS, at the loop level, is minimal in ordinary programs. Therefore, we adopted explicit synchronization as the way to parallelize non- DOALL loops and proposed lightweight lock-free synchronization techniques for extracting TLP from non-DOALL loops. We show that the proposed techniques achieve better performance than the state-of-the-art on real machines.

Alexandru Nicolau | Arun Kejariwal