Tartan: evaluating spatial computation for whole program execution

Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose architecture which integrates a reconfigurable fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an application into an instruction stream for the core and a configuration for the RF. We use a detailed simulator to capture both timing and energy numbers for all parts of the system.Our results indicate that a hierarchical RF architecture, designed around a scalable interconnect, is instrumental in harnessing the benefits of spatial computation. The interconnect uses static configuration and routing at the lower levels and a packet-switched, dynamically-routed network at the top level. Tartan is most energyefficient when almost all of the application is mapped to the RF, indicating the need for the RF to support most general-purpose programming constructs. Our initial investigation reveals that such a system can provide, on average, an order of magnitude improvement in energy-delay compared to an aggressive superscalar core on single-threaded workloads.

[1]  Seth Copen Goldstein,et al.  C to Asynchronous Dataflow Circuits: An End-to-End Toolflow , 2004 .

[2]  M. Renaudin,et al.  FPGA architecture for multi-style asynchronous logic [full-adder example] , 2005, Design, Automation and Test in Europe.

[3]  Seth Copen Goldstein,et al.  SOMA: a tool for synthesizing and optimizing memory accesses in ASICs , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[4]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[5]  Keshav Pingali,et al.  From Control Flow to Dataflow , 1991, J. Parallel Distributed Comput..

[6]  A. Smith,et al.  PRISM-II compiler and architecture , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[7]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[8]  Larry Carter,et al.  Path Analysis and Renaming for Predicated Instruction Scheduling , 2004, International Journal of Parallel Programming.

[9]  Maya Gokhale,et al.  Co-Synthesis to a Hybrid RISC/FPGA Architecture , 2000, J. VLSI Signal Process..

[10]  Rob Payne,et al.  Self-Timed FPGA Systems , 1995, FPL.

[11]  Seth Copen Goldstein,et al.  Spatial computation , 2004, ASPLOS XI.

[12]  George Varghese,et al.  HSRA: high-speed, hierarchical synchronous reconfigurable array , 1999, FPGA '99.

[13]  A. Tsai,et al.  PipeRench: A virtualized programmable datapath in 0.18 micron technology , 2002, Proceedings of the IEEE 2002 Custom Integrated Circuits Conference (Cat. No.02CH37285).

[14]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  James Kao,et al.  Subthreshold leakage modeling and reduction techniques , 2002, ICCAD 2002.

[16]  Bruce Jay Nelson Remote procedure call , 1981 .

[17]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[19]  Carl Ebeling,et al.  An FPGA for implementing asynchronous circuits , 1994, IEEE Design & Test of Computers.

[20]  Michael D. Smith,et al.  A high-performance microarchitecture with hardware-programmable functional units , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Li Shang,et al.  Dynamic power consumption in Virtex™-II FPGA family , 2002, FPGA '02.

[22]  Li Shang,et al.  High-level power modeling of CPLDs and FPGAs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[23]  Seth Copen Goldstein,et al.  Dataflow: A Complement to Superscalar , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[24]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[25]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Steven M. Nowick,et al.  Robust interfaces for mixed-timing systems , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Gregory S. Snider,et al.  A Defect-Tolerant Computer Architecture: Opportunities for Nanotechnology , 1998 .

[28]  Seth Copen Goldstein,et al.  The impact of the nanoscale on computing systems , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[29]  Seth Copen Goldstein,et al.  Peer-to-peer hardware-software interfaces for reconfigurable fabrics , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[30]  Laurent Fesquet,et al.  FPGA Architecture for Multi-Style Asynchronous Logic , 2007 .

[31]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[32]  Jens Sparsø,et al.  Scheduling discipline for latency and bandwidth guarantees in asynchronous network-on-chip , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[33]  Peter Thomas,et al.  An architecture for asynchronous FPGAs , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[34]  Andreas Moshovos,et al.  CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit , 2000, ISCA '00.

[35]  Seth Copen Goldstein,et al.  Compiling Application-Specific Hardware , 2002, FPL.

[36]  Seth Copen Goldstein,et al.  Pegasus: An Efficient Intermediate Representation , 2002 .

[37]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[38]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2007, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[39]  John Teifel,et al.  An asynchronous dataflow FPGA architecture , 2004, IEEE Transactions on Computers.

[40]  John Wawrzynek,et al.  Adapting software pipelining for reconfigurable computing , 2000, CASES '00.

[41]  Seth Copen Goldstein,et al.  HLS Support for Unconstrained Memory Accesses , 2005 .