MacSim: A MAC-Enabled High-Performance Low-Power SIMD Architecture

Single-Instruction-Multiple-Data (SIMD) architectures, which exploit data-level parallelism (DLP), are widely used to achieve high-performance and low-power computing. In most of streaming applications, such as CNN-based detection and recognition, color space conversion and various kinds of filters, multiply-accumulate is one of the most important and expensive operations to be executed. In this paper, we propose a high-performance low-power SIMD architecture with advanced multiply accumulator (MAC) support (MacSim) to improve the computational efficiency. In addition, a smart loop tiling scheme is proposed. To support this tiling even further, the MAC unit is equipped with multiple accumulator registers. According to the Design Space Exploration (DSE) of the proposed MAC unit, a MAC instance with four accumulator registers (MAC4reg) is selected as a good choice for target kernels. In this paper, a 64-PE 16-bit (processing element) SIMD instance without MAC support is taken as the baseline. For a head-to-head comparison, a 64-PE 16-bit SIMD with MAC4reg (MacSim4) and the baseline SIMD are all implemented in HDL and synthesized with a TSMC 40nm low-power library. Five streaming application kernels are mapped to both architectures. Our experimental results show with MAC4reg the runtime and energy consumption are reduced up to 38% and 42% respectively. Besides, a 4-layer CNN-based detection application is also fully mapped onto the proposed MacSim4. Working at 950MHz, MacSim4 reaches a throughput of 62.4 GOPS, which meets the requirement of real-time (720P HD, 30fps) detection. The energy consumption per PE per operation is very low, 4.7pJ/Op excluding SRAM (Static Random Access Memory) and 4.8pJ/Op including a 2k-entry SRAM bank. As a prototype, the proposed SIMD is mapped into an FPGA and can run all the kernels.

[1]  Yifan He,et al.  Xetal-Pro: An ultra-low energy and high throughput SIMD processor , 2010, Design Automation Conference.

[2]  Scott A. Mahlke,et al.  AnySP: Anytime Anywhere Anyway Signal Processing , 2010, IEEE Micro.

[3]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[4]  스트라즈더스스티븐,et al.  Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions , 2002 .

[5]  Marta Jiménez,et al.  Register tiling in nonrectangular iteration spaces , 2002, TOPL.

[6]  Yuyun Liao,et al.  A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature , 2002, IEEE J. Solid State Circuits.

[7]  Kalyan Mondal,et al.  Compact carry-save multiplier architecture and its applications , 1999 .

[8]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[9]  Charles Roth,et al.  A low-power, high-speed implementation of a PowerPC/sup TM/ microprocessor vector extension , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[10]  Yukio Sugeno,et al.  A Multiplier-Accumulator Macro for a 45 MIPS Embedded RISC Processor , 1995, ESSCIRC '95: Twenty-first European Solid-State Circuits Conference.

[11]  Kalyan Mondal,et al.  A compact carry-save multiplier architecture and its applications , 1997, Proceedings of 40th Midwest Symposium on Circuits and Systems. Dedicated to the Memory of Professor Mac Van Valkenburg.

[12]  F. Elguibaly,et al.  A fast parallel multiplier-accumulator using the modified Booth algorithm , 2000 .

[13]  Magdy A. Bayoumi,et al.  High Speed and Area-Efficient Multiply Accumulate (MAC) Unit for Digital Signal Prossing Applications , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[14]  Henk Corporaal,et al.  Speed sign detection and recognition by convolutional neural networks , 2011 .

[15]  Subhadeep Roy A sub-word-parallel Galois field multiply-accumulate unit for digital signal processors , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[16]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[17]  Yifan He,et al.  SIMD made explicit , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[18]  Scott A. Mahlke,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[19]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.