论文信息 - Feedforward-Cutset-Free Pipelined Multiply–Accumulate Unit for the Machine Learning Accelerator

Feedforward-Cutset-Free Pipelined Multiply–Accumulate Unit for the Machine Learning Accelerator

Multiply–accumulate (MAC) computations account for a large part of machine learning accelerator operations. The pipelined structure is usually adopted to improve the performance by reducing the length of critical paths. An increase in the number of flip-flops due to pipelining, however, generally results in significant area and power increase. A large number of flip-flops are often required to meet the feedforward-cutset rule. Based on the observation that this rule can be relaxed in machine learning applications, we propose a pipelining method that eliminates some of the flip-flops selectively. The simulation results show that the proposed MAC unit achieved a 20% energy saving and a 20% area reduction compared with the conventional pipelined MAC.

[1] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Tung Thanh Hoang,et al. A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[3] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Vojin G. Oklobdzija,et al. Implementing multiply-accumulate operation in multiplication time , 1997, Proceedings 13th IEEE Sympsoium on Computer Arithmetic.

[6] Igor Carron,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[7] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[8] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[9] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10] Earl E. Swartzlander,et al. A comparison of Dadda and Wallace multiplier delays , 2003, SPIE Optics + Photonics.

[11] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Keshab K. Parhi,et al. VLSI digital signal processing systems , 1999 .

[13] Marian Verhelst,et al. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[14] D. H. Jacobsohn,et al. A Suggestion for a Fast Multiplier , 1964, IEEE Trans. Electron. Comput..