Acceleration of control flow on CGRA using advanced predicated execution

Coarse-grained reconfigurable array is a very attractive architecture from the viewpoint of performance and flexibility. However, because the performance improvement is achieved by exploiting parallelism, the architecture is typically poor at handling control flow, which is sequential in nature. There have been many attempts to overcome this problem by using predicated execution techniques; however, they do not support all types of control flow or suffer from performance degradation in doing so. In addition, predicated execution schemes in general require a longer execution time because both the if- and else-paths are always executed. This paper proposes advanced predicated execution techniques that can handle and accelerate all types of control flow with only 2% hardware overhead. These techniques can also be easily extended to general SIMD machines. We implemented these techniques on a coarse-grained reconfigurable array architecture and verified its functionality and effectiveness by accelerating an H.264 deblocking filter, a kernel which is both data- and control-intensive. The results show that the proposed approach achieves up to 43% improvement in execution time compared to speculation by sacrificing 76% code size, and 24% improvement in execution time compared to the previous full predication approach, with a smaller code size.

[1]  Kiyoung Choi,et al.  Resource sharing and pipelining in coarse-grained reconfigurable architecture for domain-specific optimization , 2005, Design, Automation and Test in Europe.

[2]  Jaewook Shin,et al.  Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[3]  Nong Xiao,et al.  DM-SIMD: A new SIMD predication mechanism for exploiting superword level parallelism , 2009, 2009 IEEE 8th International Conference on ASIC.

[4]  Kiyoung Choi,et al.  FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability , 2009, 2009 International Conference on Field-Programmable Technology.

[5]  Yunheung Paek,et al.  Power-Conscious Configuration Cache Structure and Code Mapping for Coarse-Grained Reconfigurable Architecture , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[6]  Scott A. Mahlke,et al.  A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[7]  Manuel Lois Anido,et al.  Improving the operation autonomy of SIMD processing elements by using guarded instructions and pseudo branches , 2002, Proceedings Euromicro Symposium on Digital System Design. Architectures, Methods and Tools.

[8]  Kiyoung Choi,et al.  Mapping control intensive kernels onto coarse-grained reconfigurable array architecture , 2008, 2008 International SoC Design Conference.