Dynamic trace-based analysis of vectorization potential of applications

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes. In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

[1]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[2]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[3]  Gary S. Tyson,et al.  The limits of instruction level parallelism in SPEC95 applications , 1999, CARN.

[4]  Kevin B. Theobald,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO 1992.

[5]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[6]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[7]  Erez Petrank,et al.  New Algorithms for SIMD Alignment , 2007, CC.

[8]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[9]  Margaret Martonosi,et al.  Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note) , 2000, Euro-Par.

[10]  Xiangyu Zhang,et al.  Cost effective dynamic program slicing , 2004, PLDI '04.

[11]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[12]  Xiangyu Zhang,et al.  Cost and precision tradeoffs of dynamic data slicing algorithms , 2005, TOPL.

[13]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[14]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[15]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[16]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[17]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[18]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[19]  Xiangyu Zhang,et al.  Enabling tracing Of long-running multithreaded programs via dynamic execution reduction , 2007, ISSTA '07.

[20]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[21]  Rajiv Gupta,et al.  Speculative Parallelization of Sequential Loops on Multicores , 2009, International Journal of Parallel Programming.

[22]  Alan Mycroft,et al.  Limits of parallelism using dynamic dependency graphs , 2009, WODA '09.

[23]  Alan Mycroft,et al.  Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS) , 2008, LCPC.

[24]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[26]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[27]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[28]  Rajiv Gupta,et al.  Unified control flow and data dependence traces , 2007, TACO.

[29]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[30]  P. Sadayappan,et al.  Understanding parallelism-inhibiting dependences in sequential Java programs , 2010, 2010 IEEE International Conference on Software Maintenance.

[31]  Lawrence Rauchwerger,et al.  Measuring limits of parallelism and characterizing its vulnerability to resource constraints , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[32]  Xiangyu Zhang,et al.  Whole execution traces and their applications , 2005, TACO.

[33]  Rajiv Gupta,et al.  Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[34]  Xiaotong Zhuang,et al.  Exploiting Parallelism with Dependence-Aware Scheduling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[35]  Andreas Zeller,et al.  Profiling Java programs for parallelism , 2009, 2009 ICSE Workshop on Multicore Software Engineering.