Automatic Compilation of Loops to Exploit Operator Parallelism on Configurable Arithmetic Logic Units

Configurable arithmetic logic units (ALUs) offer opportunities for adapting the underlying hardware to support the varying amount of parallelism in the computation. The problem of identifying the optimal parallel configurations (a configuration is defined as a given hardware implementation of different operators along with their multiplicities) at different steps in a program is a very complex issue but, if solved, allows the power of these ALUs to be maximally used. This paper focuses on developing an automatic compilation framework for configuration analysis to exploit operator parallelism within loop nests. The focus of this work is on performing configuration analysis to minimize costly reconfiguration overheads. In our framework, we initially carry out some operator and loop transformations to expose more opportunities for configuration reuse. We then present a two pass solution. The first pass attempts to generate either maximal cutsets (a cutset is defined as a group of statements that execute under a given configuration) or maximally parallel configurations by performing an analysis on the program dependency graph (PDG) of a loop nest. The second pass analyzes the trade-offs between the costs and benefits of reconfigurations across different cutsets and attempts to eliminate the reconfiguration overheads by merging cutsets. This methodology is implemented in the SUIF compilation system and is tested using some loops extracted from Perfect benchmarks and Livermore kernels. Good speedups are obtained, showing the merit of the proposed method. The method also scales well with the loop sizes and the amount of space available on FPGAs for configurable logic.

[1]  Mahmut T. Kandemir,et al.  A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality , 1998, LCPC.

[2]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[3]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[4]  Harvey F. Silverman,et al.  Processor reconfiguration through instruction-set metamorphosis , 1993, Computer.

[5]  John Wawrzynek,et al.  Instruction-Level Parallelism for Reconfigurable Computing , 1998, FPL.

[6]  Thomas Fahringer Estimating and Optimizing Performance for Parallel Programs , 1995, Computer.

[7]  Brad L. Hutchings,et al.  Supporting FPGA microprocessors through retargetable software tools , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[8]  Guy E. Blelloch,et al.  Solving Linear Recurrences with Loop Raking , 1995, J. Parallel Distributed Comput..

[9]  Mahmut Kandemir,et al.  An Iteration Space Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality , 1999 .

[10]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[11]  Brad L. Hutchings,et al.  A dynamic instruction set computer , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[12]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[13]  Santosh Pande A compile time partitioning method for DOALL loops on distributed memory systems , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[14]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[15]  A. Smith,et al.  PRISM-II compiler and architecture , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[16]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[17]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[18]  Santosh Pande,et al.  Automatic Analysis of Loops to Exploit Operator Parallelism on Reconfigurable Systems , 1998, LCPC.

[19]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[20]  Amos R. Omondi,et al.  Computer arithmetic systems - algorithms, architecture and implementation , 1994, Prentice Hall International series in computer science.

[21]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[22]  Maya Gokhale,et al.  Malleable architecture generator for FPGA computing , 1996, Other Conferences.

[23]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[24]  Geoffrey Brown,et al.  A software development system for FPGA-based data acquisition systems , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[25]  Viktor K. Prasanna,et al.  Seeking Solutions in Configurable Computing , 1997, Computer.

[26]  Carl Ebeling,et al.  Specifying and compiling applications for RaPiD , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[27]  P. Sadayappan,et al.  An approach to communication-efficient data redistribution , 1994, ICS '94.

[28]  Keshav Pingali,et al.  Solving Alignment Using Elementary Linear Algebra , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[29]  Zhiyuan Li,et al.  Configuration compression for the Xilinx XC6200 FPGA , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[30]  Krishna V. Palem,et al.  Adaptive explicitly parallel instruction computing , 2001 .