Reconciling specialization and flexibility through compound circuits

While parallelism and multi-cores are receiving much attention as a major scalability path, customization is another, orthogonal and complementary, scalability path which can target not easily parallelizable programs or program sections. The key assets of customization are cost and power efficiency. The key limitation of customization is flexibility. However, we argue that there is no perfect balance between efficiency and flexibility, each system vendor may want to strike a different such balance. In this article, we present a method for achieving any desired balance between flexibility and efficiency by automatically combining any set of individual customization circuits into a larger compound circuit. This circuit is significantly more cost efficient than the simple union of all target circuits, and is configurable to behave as any of the target circuits, while avoiding the routing and configuration cost overhead of FPGAs. The more individual circuits are included, the larger the number of applications which can potentially benefit from this compound customization circuit, realizing flexibility at a minimal cost. Moreover, we observe that the compound circuit cost does not increase in proportion to the number of target applications, due to the wide range of common data-flow and control-flow patterns in programs. Currently, the target individual circuits correspond to loops, like most accelerators in embedded systems, but the aggregation method can accommodate circuits of any size. Using the UTDSP benchmarks and accelerators coupled with an embedded PowerPC405 processor, we show that this approach can yield an average performance improvement of 2.97, while the corresponding synthesized aggregate accelerator is 3 time smaller than the sum of individual accelerators for each target benchmark.

[1]  Li Shang,et al.  High-level power modeling of CPLDs and FPGAs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[2]  Paolo Ienne,et al.  Progressive Decomposition: A Heuristic to Structure Arithmetic Circuits , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[3]  Seth Copen Goldstein,et al.  Spatial computation , 2004, ASPLOS XI.

[4]  Cid C. de Souza,et al.  Efficient datapath merging for partially reconfigurable architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Nigel P. Topham,et al.  Resource Sharing in Custom Instruction Set Extensions , 2008, 2008 Symposium on Application Specific Processors.

[6]  Olivier Temam,et al.  From sequences of dependent instructions to functions: an approach for improving performance without ILP or speculation , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  Michael A. Schuette,et al.  The Reconfigurable Streaming Vector Processor (RSVPTM) , 2003, MICRO.

[8]  Mark Stephenson,et al.  Bidwidth analysis with application to silicon compilation , 2000, PLDI '00.

[9]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  Majid Sarrafzadeh,et al.  Area-efficient instruction set synthesis for reconfigurable system-on-chip designs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[11]  Trevor Mudge,et al.  Application-specific architecture framework for high-performance low -power embedded computing , 2006 .

[12]  Hyunseok Lee,et al.  SODA: A Low-power Architecture For Software Radio , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13]  Colin R. Reeves,et al.  Evolutionary computation: a unified approach , 2007, Genetic Programming and Evolvable Machines.

[14]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[15]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[16]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[17]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[18]  Al Davis,et al.  A loop accelerator for low power embedded VLIW processors , 2004, CODES+ISSS '04.

[19]  Paolo Bonzini,et al.  Design and Architectural Exploration of Expression-Grained Reconfigurable Arrays , 2008, 2008 Symposium on Application Specific Processors.

[20]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21]  Olivier Temam,et al.  UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development , 2007, IEEE Computer Architecture Letters.

[22]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.