Enabling Multithreading on CGRAs

Coarse-Grained Reconfigurable Arrays or CGRAs are programmable fabrics that promise both high performance and high power efficiency. Traditionally, CGRAs were used to accelerate extremely-embedded systems, and were typically manually programmed. However, as CGRAs are conceived to be used as more general-purpose accelerators, there is a need to develop software tools and capabilities. Much work has been done on developing compiler techniques for CGRAs, making programming them easier, however, there is no support for multithreading. As an accelerator to a multithreaded processor, CGRAs now are restricted to accelerating only one kernel of one thread running on the processor at any point in time. Supporting multithreading is difficult, since the start times and end times of threads are dynamic in nature, while CGRAs are statically scheduled. In this paper, we propose a strategy to do multithreading on a CGRA. The chief capability that we develop is a scheme to quickly transform an existing application mapping using the entire CGRA to one using only a fraction of it. Our experimental results on kernels from multimedia applications demonstrate that multithreading support can improve the total throughput of a CGRA by over 30%, 75%, and 150% on 4x4, 6x6, and 8x8 CGRAs, respectively, compared to single-threaded methods.

[1]  Rudy Lauwereins,et al.  Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: a case study , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[2]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Kunle Olukotun,et al.  REMARC : Reconfigurable Multimedia Array Coprocessor , 1999 .

[4]  Jürgen Becker,et al.  Architecture, memory and interface technology integration of an industrial/ academic configurable system-on-chip (CSoC) , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[5]  Michalis D. Galanis,et al.  A compiler method for memory-conscious mapping of applications on coarse-grained reconfigurable architectures , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Kiyoung Choi,et al.  Resource sharing and pipelining in coarse-grained reconfigurable architecture for domain-specific optimization , 2005, Design, Automation and Test in Europe.

[8]  Frédéric Vivien,et al.  A constructive solution to the juggling problem in processor array synthesis , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[9]  Devereaux Conrad Chen Programmable arithmetic devices for high speed digital signal processing , 1992 .

[10]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[11]  Kunle Olukotun,et al.  REMARC (abstract): reconfigurable multimedia array coprocessor , 1998, FPGA '98.

[12]  Scott A. Mahlke,et al.  Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures , 2006, CASES '06.

[13]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[14]  Bingfeng Mei,et al.  Mapping an H.264/AVC decoder onto the ADRES reconfigurable architecture , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[15]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[16]  B. Ramakrishna Rau,et al.  Register allocation for software pipelined loops , 1992, PLDI '92.

[17]  André DeHon,et al.  MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[18]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Reiner W. Hartenstein,et al.  A datapath synthesis system for the reconfigurable datapath architecture , 1995, ASP-DAC '95.

[20]  Yunheung Paek,et al.  A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[21]  S. Kung,et al.  VLSI Array processors , 1985, IEEE ASSP Magazine.

[22]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[23]  Kiyoung Choi,et al.  Compilation approach for coarse-grained reconfigurable architectures , 2003, IEEE Design & Test of Computers.

[24]  Stamatis Vassiliadis,et al.  Fine- and Coarse-Grain Reconfigurable Computing , 2007 .

[25]  Alain Darte Regular partitioning for synthesizing fixed-size systolic arrays , 1991, Integr..

[26]  Michalis D. Galanis,et al.  Resource aware mapping on coarse grained reconfigurable arrays , 2009, Microprocess. Microsystems.

[27]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[28]  Yashwant K. Malaiya Proceedings of the 24th annual international symposium on Microarchitecture , 1991 .

[29]  Aviral Shrivastava,et al.  SPKM : A novel graph drawing based algorithm for application mapping onto coarse-grained reconfigurable architectures , 2008, 2008 Asia and South Pacific Design Automation Conference.

[30]  Carl Ebeling,et al.  Mapping applications to the RaPiD configurable architecture , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[31]  Mladen Berekovic,et al.  ADRES & DRESC: Architecture and Compiler for Coarse-GrainReconfigurable Processors , 2007 .

[32]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[33]  Rudy Lauwereins,et al.  DRESC: a retargetable compiler for coarse-grained reconfigurable architectures , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[34]  Nader Bagherzadeh,et al.  A Modulo Scheduling Algorithm for a Coarse-Grain Reconfigurable Array Template , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[35]  Carl Ebeling,et al.  SPR: an architecture-adaptive CGRA mapping tool , 2009, FPGA '09.

[36]  Scott A. Mahlke,et al.  CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[37]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.