It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.
[1]
Henk Corporaal,et al.
1000 fps visual servoing on the reconfigurable wide SIMD processor
,
2010
.
[2]
Scott A. Mahlke,et al.
AnySP: Anytime Anywhere Anyway Signal Processing
,
2009,
IEEE Micro.
[3]
Wei Zheng,et al.
Architecture Design for H.264/AVC Integer Motion Estimation with Minimum Memory Bandwidth
,
2007,
IEEE Transactions on Consumer Electronics.
[4]
Changhee Lee,et al.
A general purpose SliM-II image processor
,
1997,
Proceedings Fourth IEEE International Workshop on Computer Architecture for Machine Perception. CAMP'97.
[5]
Scott A. Mahlke,et al.
Customizing wide-SIMD architectures for H.264
,
2009,
2009 International Symposium on Systems, Architectures, Modeling, and Simulation.