PIT: Processing-In-Transmission With Fine-Grained Data Manipulation Networks

In the domain of data parallel computation, most works focus on data flow optimization inside the PE array and favorable memory hierarchy to pursue the maximum parallelism and efficiency, while the importance of data contents has been overlooked for a long time. As we observe, for structured data, insights on the contents (i.e., their values and locations within a structured form) can greatly benefit the computation performance, as fine-grained data manipulation can be performed. In this paper, we claim that by providing a flexible and adaptive data path, an efficient architecture with capability of fine-grained data manipulation can be built. Specifically, we design SOM, a portable and highly-adaptive data transmission network, with the capability of operand sorting, non-blocking self-route ordering and multicasting. Based on SOM, we propose the processing-in-transmission architecture (PITA), which extends the traditional SIMD architecture to perform some fundamental data processing during its transmission, by embedding multiple levels of SOM networks on the data path. We evaluate the performance of PITA in two irregular computation problems. We first map the matrix inversion task onto PITA and show considerable performance gain can be achieved, resulting in <inline-formula><tex-math notation="LaTeX">$3\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>3</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq1-3048233.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$20\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>20</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq2-3048233.gif"/></alternatives></inline-formula> speedup against Intel MKL, and <inline-formula><tex-math notation="LaTeX">$20\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>20</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq3-3048233.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$40\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>40</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq4-3048233.gif"/></alternatives></inline-formula> against cuBLAS. Then we evaluate our PITA on sparse CNNs. The results indicate that PITA can greatly improve computation efficiency and reduce memory bandwidth pressure. We achieved <inline-formula><tex-math notation="LaTeX">$2\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq5-3048233.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$9\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>9</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xia-ieq6-3048233.gif"/></alternatives></inline-formula> speedup against several state-of-art accelerators on sparse CNN, where nearly 100 percent PE efficiency is maintained under high sparsity. We believe the concept of PIT is a promising computing paradigm that can enlarge the capability of traditional parallel architecture.

[1]  Arvind,et al.  Terabyte Sort on FPGA-Accelerated Flash Storage , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  H.C. Neto,et al.  Memory Optimized Architecture for Efficient Gauss-Jordan Matrix Inversion , 2007, 2007 3rd Southern Conference on Programmable Logic.

[3]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[4]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[5]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[6]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7]  Hiroshi Inoue,et al.  SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures , 2015, Proc. VLDB Endow..

[8]  Natalie D. Enright Jerger,et al.  On-Chip Networks , 2009, On-Chip Networks.

[9]  Yuanyuan Yang,et al.  A New Self-Routing Multicast Network , 1999, IEEE Trans. Parallel Distributed Syst..

[10]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[11]  Valery Sklyarov,et al.  Hardware implementation of recursive sorting algorithms , 2011, 2011 International Conference on Electronic Devices, Systems and Applications (ICEDSA).

[12]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[13]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[14]  Jean C. Walrand,et al.  A Benes packet network , 2012, 2013 Proceedings IEEE INFOCOM.

[15]  Yann LeCun,et al.  1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[16]  Nanning Zheng,et al.  COCOA: Content-Oriented Configurable Architecture Based on Highly-Adaptive Data Transmission Networks , 2020, ACM Great Lakes Symposium on VLSI.

[17]  David A. Patterson,et al.  A new golden age for computer architecture , 2019, Commun. ACM.

[18]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[19]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Igor L. Markov,et al.  Limits on fundamental limits to computation , 2014, Nature.

[21]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[22]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[23]  Zhan Wang,et al.  A New Traffic Offloading Method with Slow Switching Optical Device in Exascale Computer , 2019, 2019 IEEE 37th International Conference on Computer Design (ICCD).

[24]  Mohamed Ibrahim,et al.  Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).