Multiloop Parallelisation Using Unrolling and Fission

A technique for parallelising multiple loops in a heterogeneous computing system is presented. Loops are first unrolled and then broken up intomultiple tasks which are mapped to reconfigurable hardware. A performance-driven optimisation is applied to find the best unrolling factor for each loop under hardware size constraints. The approach is demonstrated using three applications: speech recognition, image processing, and the N-Body problem. Experimental results show that a maximum speedup of 34 is achieved on a 274MHz FPGA for the N-Body over a 2.6GHz microprocessor, which is 4.1 times higher than that of an approach without unrolling.

[1]  Wayne Luk,et al.  Unrolling-based loop mapping and scheduling , 2008, 2008 International Conference on Field-Programmable Technology.

[2]  S. Aarseth Direct methods for N-Body simulations , 1994 .

[3]  Raul Camposano,et al.  Path-based scheduling for synthesis , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4]  Ranga Vemuri,et al.  An automated temporal partitioning and loop fission approach for FPGA based reconfigurable synthesis of DSP applications , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[5]  João M. P. Cardoso,et al.  Loop dissevering: a technique for temporally partitioning loops in dynamically reconfigurable computing platforms , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Wayne Luk,et al.  Pipelining designs with loop-carried dependencies , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[7]  Frode Eika Sandnes,et al.  A new strategy for multiprocessor scheduling of cyclic task graphs , 2005, Int. J. High Perform. Comput. Netw..

[8]  Tao Yang,et al.  Heuristic Algorithms for Scheduling Iterative Task Computations on Distributed Memory Machines , 1997, IEEE Trans. Parallel Distributed Syst..

[9]  Nader Bagherzadeh,et al.  A Modulo Scheduling Algorithm for a Coarse-Grain Reconfigurable Array Template , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Wayne Luk,et al.  A high-level compilation toolchain for heterogeneous systems , 2009, 2009 IEEE International SOC Conference (SOCC).

[11]  Wayne Luk,et al.  Pipeline vectorization , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[12]  Ahmed Amine Jerraya,et al.  Formulation and evaluation of scheduling techniques for control flow graphs , 1995, Proceedings of EURO-DAC. European Design Automation Conference.

[13]  Wayne Luk,et al.  Mapping and scheduling with task clustering for heterogeneous computing systems , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[14]  Stephen M. Smith,et al.  SUSAN—A New Approach to Low Level Image Processing , 1997, International Journal of Computer Vision.

[15]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[16]  Premysl Sucha,et al.  Efficient FPGA Implementation of Equalizer for Finite Interval Constant Modulus Algorithm , 2006, 2006 International Symposium on Industrial Embedded Systems.

[17]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.