An Evaluation of Vectorizing Compilers

Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.

[1]  Li Shen,et al.  Implicit Data Permutation for SIMD Devices , 2009, 2009 Fourth International Conference on Embedded and Multimedia Computing.

[2]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[3]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[4]  David A. Padua,et al.  A Simple Framework to Calculate the Reaching Definition of Array References and Its Use in Subscript Array Analysis , 2000, Concurr. Pract. Exp..

[5]  S. I. Feldman,et al.  Availability of f2c—a Fortran to C converter , 1991, FORF.

[6]  David J. Kuck,et al.  Time and Parallel Processor Bounds for Linear Recurrence Systems , 1975, IEEE Transactions on Computers.

[7]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[8]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[9]  Mark A. Taylor,et al.  Direct Numerical Simulations of Turbulence Data Generation and Statistical Analysis , 2005 .

[10]  Alan Jay Smith,et al.  Design and characterization of the Berkeley multimedia workload , 2002, Multimedia Systems.

[11]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[12]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[13]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[14]  Dmitri Bronnikov A practical adoption of partial redundancy elimination , 2004, SIGP.

[15]  Gang Ren,et al.  An empirical study on the vectorization of multimedia applications for multimedia extensions , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[16]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..