Impact of Loop Tiling on the Controller Logic of Acceleration Engines

High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design of such programmable or non-programmable accelerators. Loop tiling is a widely used transformation in such methodologies for dimensioning of such accelerators in order to match inherent massive parallelism of considered algorithms with available functional units and processor elements. Innately, the applications are data-flow dominant and have almost no control flow, but the application of tiling techniques has the disadvantage of a more complex control and communication flow. In this paper, we present a methodology for the automatic generation of the control engines of such accelerators. The controller orchestrates the data transfer and computation. The effect of tiling on area, latency, and power overhead of the controller is studied in detail. It is shown that the controller has a substantial overhead of up to 50% in for different tiling and throughput parameters. The energy-delay product is also used as a metric for identifying optimal accelerator designs.

[1]  Preeti Ranjan Panda,et al.  The Impact of Loop Unrolling on Controller Delay in High Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Markus Weinhardt,et al.  PACT XPP—A Self-Reconfigurable Data Processing Architecture , 2004, The Journal of Supercomputing.

[3]  Jürgen Teich,et al.  Efficient control generation for mapping nested loop programs onto processor arrays , 2007, J. Syst. Archit..

[4]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[5]  Sanjay V. Rajopadhye,et al.  Energy/power estimation of regular processor arrays , 2002, 15th International Symposium on System Synthesis, 2002..

[6]  Jürgen Teich,et al.  The PAULA Language for Designing Multi-Dimensional Dataflow-Intensive Applications , 2008, MBMV.

[7]  M. Horowitz,et al.  Energy dissipation in general purpose processors , 1995, 1995 IEEE Symposium on Low Power Electronics. Digest of Technical Papers.

[8]  Christian Haubelt,et al.  Model-based synthesis and optimization of static multi-rate image processing algorithms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[9]  Bernard Pottier,et al.  A holistic approach for tightly coupled reconfigurable parallel processors , 2009, Microprocess. Microsystems.

[10]  Alexandru Turjan,et al.  Deriving efficient control in Process Networks with Compaan/Laura , 2008, Int. J. Embed. Syst..

[11]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[12]  Pierre G. Paulin,et al.  Scheduling and Binding Algorithms for High-Level Synthesis , 1989, 26th ACM/IEEE Design Automation Conference.

[13]  Lothar Thiele,et al.  Resource constrained scheduling of uniform algorithms , 1993, J. VLSI Signal Process..

[14]  Alain Darte,et al.  Hardware/software interface for multi-dimensional processor arrays , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[15]  Vincent Loechner,et al.  Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions , 2007, Algorithmica.

[16]  Viktor K. Prasanna,et al.  Energy-Efficient Matrix Multiplication on FPGAs , 2002, FPL.

[17]  Wayne H. Wolf The future of multiprocessor systems-on-chips , 2004, Proceedings. 41st Design Automation Conference, 2004..

[18]  Hugo De Man,et al.  Background memory area estimation for multidimensional signal processing systems , 1995, IEEE Trans. Very Large Scale Integr. Syst..

[19]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[20]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.