Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays

Coarse Grain Reconfigurable Array (CGRA) architectures have been extensively used for accelerating time consuming loops. The design of such systems requires good balance between the architecture abilities and the loops’ characteristics. A reliable design is characterized by optimized cost-performance trade-off. The main target of this paper is to present an exploration framework that automates the evaluation of CGRA architectures. In specific, the framework helps the designer to identify CGRA architectures tuned toward a specific application domain. The whole process is assisted: (1) by an optimized retargetable compiler based on modulo scheduling and (2) by the Synopsys Design Compiler that provides realization metrics such as the area and clock frequency. Both target on the description of a parametric CGRA architecture template which is capable of instantiating a large diversity of these architectures. Until now, many studies suggest that clock frequency influences performance. However, none of them examines the impact of architecture on clock frequency and performance. Our work studies in a unified way for the first time the area, the clock frequency, the instructions per cycle and performance. Hence, architectures with good compromise between cost and performance can be identified. Another objective of the paper is to present the advances made to the compiler approach used by the exploration framework. In specific, a new more effective priority scheme is proposed while the modulo scheduler has been equipped with backtracking capability. The experiments outline the algorithm’s efficiency and scalability for a given set of DSP benchmarks. Moreover, optimized architectures with respect to cost-performance trade-off have been identified by an exploration over 72 CGRA architecture alternatives.

[1]  Michalis D. Galanis,et al.  Exploring the design space of an optimized compiler approach for mesh-like coarse-grained reconfigurable architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[2]  Hugo De Man,et al.  Formalized methodology for data reuse: exploration for low-power hierarchical memory mappings , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[3]  Kunle Olukotun,et al.  A quantitative analysis of reconfigurable coprocessors for multimedia applications , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[4]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip , 1999 .

[5]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[6]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[7]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[8]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[9]  Michalis D. Galanis,et al.  Partitioning Methodology for Heterogeneous Reconfigurable Functional Units , 2006, The Journal of Supercomputing.

[10]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[11]  Michalis D. Galanis,et al.  Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[13]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[14]  Markus Weinhardt,et al.  XPP-VC: A C Compiler with Temporal Partitioning for the PACT-XPP Architecture , 2002, FPL.

[15]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[16]  Rainer Leupers,et al.  Optimized array index computation in DSP programs , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[17]  Kiyoung Choi,et al.  Compilation approach for coarse-grained reconfigurable architectures , 2003, IEEE Design & Test of Computers.

[18]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[19]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[20]  R. Hartenstein,et al.  A datapath synthesis system for the reconfigurable datapath architecture , 1995, Proceedings of ASP-DAC'95/CHDL'95/VLSI'95 with EDA Technofair.

[21]  B. Ramakrishna Rau,et al.  Register allocation for software pipelined loops , 1992, PLDI '92.

[22]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[23]  Steven J. E. Wilton,et al.  Register file architecture optimization in a coarse-grained reconfigurable architecture , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[24]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[25]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[26]  Horácio C. Neto,et al.  Data-Driven Regular Reconfigurable Arrays: Design Space Exploration and Mapping , 2005, SAMOS.

[27]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[28]  Reiner W. Hartenstein,et al.  Design-Space Exploration of Low Power Coarse Grained Reconfigurable Datapath Array Architectures , 2000, PATMOS.

[29]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration , 1998 .

[30]  Guang R. Gao,et al.  Software pipelining showdown: optimal vs. heuristic methods in a production compiler , 1996, PLDI '96.

[31]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[32]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[33]  Carl Ebeling,et al.  Implementing an OFDM receiver on the RaPiD reconfigurable architecture , 2003, IEEE Transactions on Computers.

[34]  Javier Zalamea,et al.  Register constrained modulo scheduling , 2004, IEEE Transactions on Parallel and Distributed Systems.