The acceleration of pipeline workloads under the FPGA area and bandwidth constraints

This work is motivated by the advance of heterogeneous computing and the strong demands of workload acceleration in practice. By considering pipeline workloads over FPGA, this paper explores a systematic methodology to configure the hardware instances of each pipeline stage such that the maximum of the execution time of each stage is minimized, where the FPGA allocation with the memory bandwidth constraint is considered. For the target problem, an algorithm is proposed and proved being optimal, and a real implementation study is conducted. In the experimental results, an image filter FPGA implementation can outperform the CPU, GPU, and baseline FPGA solutions by 460%, 73%, and 1030%, respectively. Extensive simulations were also conducted with a large FPGA size to show the scalability of this work.

[1]  T. Nodes,et al.  Median filters: Some modifications and their properties , 1982 .

[2]  Fadi J. Kurdahi,et al.  Area and timing estimation for lookup table based FPGAs , 1996, Proceedings ED&TC European Design and Test Conference.

[3]  Matthew Haines,et al.  Approaches for integrating task and data parallelism , 1998, IEEE Concurr..

[4]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  Giovanni Ramponi,et al.  Image enhancement via adaptive unsharp masking , 2000, IEEE Trans. Image Process..

[6]  Peter Cumming,et al.  The TI OMAP™ Platform Approach to SOC , 2003 .

[7]  Daniel Rueckert,et al.  FPGA-based computation of free-form deformations in medical image registration , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[8]  Bruce A. Draper,et al.  Accelerated image processing on FPGAs , 2003, IEEE Trans. Image Process..

[9]  Marco Platzner,et al.  A Runtime Environment for Reconfigurable Hardware Operating Systems , 2004, FPL.

[10]  Zheng Yan-shu Efficient Packet Classification for Network Intrusion Detection using FPGA , 2005 .

[11]  P. Hagmann,et al.  Mapping complex tissue architecture with diffusion spectrum magnetic resonance imaging , 2005, Magnetic resonance in medicine.

[12]  Robert W. Brodersen,et al.  A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[13]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[14]  G. Rezai-Rad,et al.  Comparison of SUSAN and Sobel Edge Detection in MRI Images for Feature Extraction , 2006, 2006 2nd International Conference on Information & Communication Technologies.

[15]  Kiamal Z. Pekmestzi,et al.  An Elliptic Curve Cryptosystem Design Based on FPGA Pipeline Folding , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[16]  Hayden Kwok-Hay So,et al.  A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH , 2008, TECS.

[17]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[18]  Kevin Skadron,et al.  Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.

[19]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[20]  G. Padmavathi,et al.  Performance evaluation of the various edge detectors and filters for the noisy IR images , 2009 .

[21]  David A. Padua,et al.  Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[22]  Implementing FPGA Design with the OpenCL Standard , 2010 .

[23]  Seah Hock Soon,et al.  Demons Kernel Computation with Single-pass Stream Processing on FPGA , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[24]  Lin Ma,et al.  A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[25]  Yung-Chin Hsu,et al.  A large deformation diffeomorphic metric mapping solution for diffusion spectrum imaging datasets , 2012, NeuroImage.

[26]  D. Coddington,et al.  The big deal about big data. , 2013, Healthcare financial management : journal of the Healthcare Financial Management Association.

[27]  Tei-Wei Kuo,et al.  Real-Time Task Scheduling on Island-Based Multi-Core Platforms , 2015, IEEE Transactions on Parallel and Distributed Systems.