FlexVec: auto-vectorization for irregular loops

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions involved in a strongly connected component (SCC) are deemed not vectorizable unless the SCC can be eliminated using techniques such as scalar expansion or privatization. Therefore, traditional vectorization techniques are limited in their ability to efficiently handle loops with dynamic cross-iteration dependencies or complex control flow interweaved within the dependence cycles. When potential dependencies do not occur very often, the end-result is under utilization of the SIMD hardware. In this paper, we propose FlexVec architecture that combines new vector instructions with novel code generation techniques to dynamically adjusts vector length for loop statements affected by cross-iteration dependencies that happen at runtime. We have designed and implemented FlexVec's new ISA as extensions to the recently released AVX-512 ISA. We have evaluated the performance improvements enabled by FlexVec vectorization for 11 C/C++ SPEC 2006 benchmarks and 7 real applications with AVX-512 vectorization as baseline. We show that FlexVec vectorization technique produces a Geomean speedup of 9% for SPEC 2006 and a Geomean speedup of 11% for 7 real applications.

[1]  Harry Berryman,et al.  Run-Time Scheduling and Execution of Loops on Message Passing Machines , 1990, J. Parallel Distributed Comput..

[2]  Sanu Mathew,et al.  A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS , 2012, 2012 IEEE International Solid-State Circuits Conference.

[3]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[4]  Krste Asanovic,et al.  Compiling for vector-thread architectures , 2008, CGO '08.

[5]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[6]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[7]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[8]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[9]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[10]  Vivek Sarkar,et al.  Efficient Selection of Vector Instructions Using Dynamic Programming , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[12]  Maged M. Michael,et al.  Transactional memory support in the IBM POWER8 processor , 2015, IBM J. Res. Dev..

[13]  Ronak Singhal,et al.  Performance Analysis and Validation of the Intel Pentium 4 Processor on 90nm Technology , 2004 .

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[16]  Timothy M. Jones,et al.  PSLP: Padded SLP automatic vectorization , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Mahmut T. Kandemir,et al.  A compiler framework for extracting superword level parallelism , 2012, PLDI '12.

[18]  Karthikeyan Sankaralingam,et al.  Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[19]  R. C. Whaley,et al.  Vectorization past dependent branches through speculation , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[20]  Scott A. Mahlke,et al.  SIMD defragmenter: efficient ILP realization on data-parallel architectures , 2012, ASPLOS XVII.

[21]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[22]  Joel H. Saltz,et al.  Programming Irregular Applications: Runtime Support, Compilation and Tools , 1997, Adv. Comput..

[23]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[24]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.