TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory

Personalized recommendation systems are gaining significant traction due to their industrial importance. An important building block of recommendation systems consists of the embedding layers, which exhibit a highly memory-intensive characteristic. A fundamental primitive of embedding layers is the embedding vector gathers followed by vector reductions, exhibiting low arithmetic intensity and becoming bottlenecked by the memory throughput. To tackle such a challenge, recent proposals employ a near-data processing (NDP) solution at the DRAM rank-level, achieving impressive performance speedups. We observe that prior rank-level-parallelism-based NDP solutions leave significant performance potential on the table as they do not fully reap the abundant transfer throughput inherent in DRAM datapaths. We propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with “in-DRAM” reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7 × and 3.9 × speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.

[1]  O Seongil,et al.  Row-buffer decoupling: A case for low-latency DRAM microarchitecture , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[2]  O Seongil,et al.  Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Carole-Jean Wu,et al.  RecSSD: near data processing for solid state drive based recommendation inference , 2021, ASPLOS.

[4]  Yuan Xie,et al.  DRISA: A DRAM-based Reconfigurable In-Situ Accelerator , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Andrew B. Kahng,et al.  CACTI-IO: CACTI with off-chip power-area-timing models , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Minsoo Rhu,et al.  Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[8]  John Kim,et al.  NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units , 2019, ASPLOS.

[9]  Nam Sung Kim,et al.  NetDIMM: Low-Latency Near-Memory Network Interface Architecture , 2019, MICRO.

[10]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[11]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[12]  Jie Yang,et al.  Mixed-Precision Embedding Using a Cache , 2020, ArXiv.

[13]  Yuan Xie,et al.  SCOPE: A Stochastic Computing Engine for DRAM-Based In-Situ Accelerator , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[15]  Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.

[16]  Wei Lin,et al.  Characterizing Deep Learning Training Workloads on Alibaba-PAI , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[17]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[18]  Sachin Katti,et al.  Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[19]  Jinjun Xiong,et al.  Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Yuan Xie,et al.  MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm , 2019, MICRO.

[22]  Martin D. Schatz,et al.  RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[23]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[25]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[26]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[28]  Eunhyeok Park,et al.  McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[30]  O Seongil,et al.  CIDR: A Cache Inspired Area-Efficient DRAM Resilience Architecture against Permanent Faults , 2015, IEEE Computer Architecture Letters.

[31]  Fabrice Devaux,et al.  The true Processing In Memory accelerator , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[32]  David P. Luebke,et al.  CUDA: Scalable parallel programming for high-performance scientific computing , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[33]  F. Lemmermeyer Error-correcting Codes , 2005 .

[34]  Alexander Heinecke,et al.  Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[36]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[38]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[39]  Minsoo Rhu,et al.  Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[40]  Sung Kyu Lim,et al.  FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[41]  Hyoung-Joo Kim,et al.  A 3.2 Gbps/pin 8 Gbit 1.0 V LPDDR4 SDRAM With Integrated ECC Engine for Sub-1 V DRAM Core Operation , 2015, IEEE Journal of Solid-State Circuits.

[42]  Dimin Niu,et al.  iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[43]  John D. Leidel,et al.  PIMS: a lightweight processing-in-memory accelerator for stencil computations , 2019, MEMSYS.

[44]  Chia-Lin Yang,et al.  Improving DRAM latency with dynamic asymmetric subarray , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  O Seongil,et al.  Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[46]  Xuan Zhang,et al.  Near-Memory Processing in Action: Accelerating Personalized Recommendation With AxDIMM , 2021, IEEE Micro.

[47]  Hankyu Chi,et al.  23.2 A 1.1V 1ynm 6.4Gb/s/pin 16Gb DDR5 SDRAM with a Phase-Rotator-Based DLL, High-Speed SerDes and RX/TX Equalization Scheme , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[48]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[50]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[51]  Jung Ho Ahn,et al.  TRiM: Tensor Reduction in Memory , 2021, IEEE Computer Architecture Letters.

[52]  Oscar Plata,et al.  NATSA: A Near-Data Processing Accelerator for Time Series Analysis , 2020, 2020 IEEE 38th International Conference on Computer Design (ICCD).

[53]  Jie Li,et al.  PIMS: a lightweight processing-in-memory accelerator for stencil computations , 2019, MEMSYS.