论文信息 - PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators

PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators

The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the user's application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates.

[1] Marina C. Chen,et al. A Design Methodology for Synthesizing Parallel Algorithms and Architectures , 1986, J. Parallel Distributed Comput..

[2] V. van Dongen,et al. Uniformization of linear recurrence equations: a step toward the automatic synthesis of systolic arrays , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[3] Wayne Luk,et al. Memory Access Optimization and RAM Inference for Pipeline Vectorization , 1999, FPL.

[4] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.

[5] William Pugh,et al. The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[6] Frédéric Vivien,et al. Constructing and exploiting linear schedules with prescribed parallelism , 2002, TODE.

[7] Steven W. K. Tjiang,et al. SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[8] Scott A. Mahlke,et al. Bitwidth cognizant architecture synthesis of custom hardwareaccelerators , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[9] Dennis Gannon,et al. On the problem of optimizing data transfers for complex memory systems , 1988, ICS '88.

[10] Ed F. Deprettere,et al. HIFI: From Parallel Algorithm to Fixed-Size VLSI Processor Array , 1993 .

[11] B. Ramakrishna Rau,et al. Machine-Description Driven Compilers for EPIC and VLIW Processors , 1999, Des. Autom. Embed. Syst..

[12] Patrice Quinton,et al. Systolic algorithms and architectures , 1987 .

[13] William Pugh,et al. A practical algorithm for exact array dependence analysis , 1992, CACM.

[14] Dan I. Moldovan,et al. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[15] B. Ramakrishna Rau,et al. Automatic architectural synthesis of VLIW and EPIC processors , 1999, Proceedings 12th International Symposium on System Synthesis.

[16] P. Six,et al. Cathedral-II: A Silicon Compiler for Digital Signal Processing , 1986, IEEE Design & Test of Computers.

[17] Doran Wilde,et al. Regular array synthesis using ALPHA , 1994, Proceedings of IEEE International Conference on Application Specific Array Processors (ASSAP'94).