论文信息 - Multiloop Parallelisation Using Unrolling and Fission

Multiloop Parallelisation Using Unrolling and Fission

A technique for parallelising multiple loops in a heterogeneous computing system is presented. Loops are first unrolled and then broken up intomultiple tasks which are mapped to reconfigurable hardware. A performance-driven optimisation is applied to find the best unrolling factor for each loop under hardware size constraints. The approach is demonstrated using three applications: speech recognition, image processing, and the N-Body problem. Experimental results show that a maximum speedup of 34 is achieved on a 274MHz FPGA for the N-Body over a 2.6GHz microprocessor, which is 4.1 times higher than that of an approach without unrolling.

[1] Wayne Luk,et al. Unrolling-based loop mapping and scheduling , 2008, 2008 International Conference on Field-Programmable Technology.

[2] S. Aarseth. Direct methods for N-Body simulations , 1994 .

[3] Raul Camposano,et al. Path-based scheduling for synthesis , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4] Ranga Vemuri,et al. An automated temporal partitioning and loop fission approach for FPGA based reconfigurable synthesis of DSP applications , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[5] João M. P. Cardoso,et al. Loop dissevering: a technique for temporally partitioning loops in dynamically reconfigurable computing platforms , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6] Wayne Luk,et al. Pipelining designs with loop-carried dependencies , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[7] Frode Eika Sandnes,et al. A new strategy for multiprocessor scheduling of cyclic task graphs , 2005, Int. J. High Perform. Comput. Netw..

[8] Tao Yang,et al. Heuristic Algorithms for Scheduling Iterative Task Computations on Distributed Memory Machines , 1997, IEEE Trans. Parallel Distributed Syst..

[9] Nader Bagherzadeh,et al. A Modulo Scheduling Algorithm for a Coarse-Grain Reconfigurable Array Template , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10] Wayne Luk,et al. A high-level compilation toolchain for heterogeneous systems , 2009, 2009 IEEE International SOC Conference (SOCC).

[11] Wayne Luk,et al. Pipeline vectorization , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[12] Ahmed Amine Jerraya,et al. Formulation and evaluation of scheduling techniques for control flow graphs , 1995, Proceedings of EURO-DAC. European Design Automation Conference.

[13] Wayne Luk,et al. Mapping and scheduling with task clustering for heterogeneous computing systems , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[14] Stephen M. Smith,et al. SUSAN—A New Approach to Low Level Image Processing , 1997, International Journal of Computer Vision.

[15] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[16] Premysl Sucha,et al. Efficient FPGA Implementation of Equalizer for Finite Interval Constant Modulus Algorithm , 2006, 2006 International Symposium on Industrial Embedded Systems.

[17] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.