Exploiting superword level parallelism with multimedia instruction sets

Increasing focus on multimedia applications has prompted the additionof multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.

[1]  Huy Nguyen,et al.  AltiVec/sup TM/: bringing vector technology to the PowerPC/sup TM/ processor family , 1999, 1999 IEEE International Performance, Computing and Communications Conference (Cat. No.99CH36305).

[2]  Richard M. Brown,et al.  The ILLIAC IV Computer , 1968, IEEE Transactions on Computers.

[3]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[4]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[5]  Steve Johnson,et al.  Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[6]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[7]  Bernhard Steffen,et al.  Lazy code motion , 1992, PLDI '92.

[8]  Sony’s Emotionally Charged Chip , 1999 .

[9]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.

[10]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[11]  Peter Christy,et al.  Software to support massively parallel computing on the MasPar MP-1 , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[12]  Derek J. DeVries A vectorizing SUIF compiler, implementation and performance , 1997 .

[13]  Tom Blank,et al.  The MasPar MP-1 architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[14]  Guy L. Steele,et al.  Compiling Fortran 8x array features for the connection machine computer system , 1988, PPoPP 1988.

[15]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[16]  Saman Amarasinghe,et al.  Parallelizing Compiler Techniques Based on Linear Inequalities , 1997 .

[17]  James Coyle,et al.  Evaluation of fortran vector compilers and preprocessors , 1991, Softw. Pract. Exp..

[18]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[19]  Mark Stephenson,et al.  Bidwidth analysis with application to silicon compilation , 2000, PLDI '00.

[20]  Corinna G. Lee,et al.  Simple vector microprocessors for multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[21]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[22]  Martin C. Rinard,et al.  Pointer analysis for multithreaded programs , 1999, PLDI '99.

[23]  Ken Kennedy,et al.  PFC: A Program to Convert Fortran to Parallel Form , 1982 .

[24]  Corinna G. Lee,et al.  Initial results on the performance and cost of vector microprocessors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[25]  Craig Hansen MicroUnity's MediaProcessor architecture , 1996, IEEE Micro.

[26]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[27]  Pradeep K. Dubey,et al.  How Multimedia Workloads Will Change Processor Design , 1997, Computer.

[28]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.