FlashEmbedding: storing embedding tables in SSD for large-scale recommender systems

We present FlashEmbedding, a hardware/software co-design solution for storing embedding tables on SSDs for large-scale recommendation inference under memory capacity-limited systems. FlashEmbedding leverages an embedding semantic-aware SSD, an embedding-oriented software cache, and pipeline techniques to improve the overall performance. We evaluate the performance of FlashEmbedding with our FPGA-based prototype SSD on a real-world public dataset. FlashEmbedding achieves up to 17.44× lower latency in embedding lookups and 2.89× lower end-to-end latency than baseline solution in a memory capacity-limted system.

[1]  Minsub Kim,et al.  Reducing tail latency of DNN-based recommender systems using in-storage processing , 2020, APSys.

[2]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[3]  Dik Lun Lee,et al.  Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.

[4]  Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.

[5]  J. Ian Munro,et al.  Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[6]  Carole-Jean Wu,et al.  RecSSD: near data processing for solid state drive based recommendation inference , 2021, ASPLOS.

[7]  Jian Huang,et al.  FlatFlash: Exploiting the Byte-Accessibility of SSDs within a Unified Memory-Storage Hierarchy , 2019, ASPLOS.

[8]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[9]  Joo Young Hwang,et al.  2B-SSD: The Case for Dual, Byte- and Block-Addressable Solid-State Drives , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[10]  Bor-Yiing Su,et al.  Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems , 2020, ArXiv.

[11]  Jason Cong,et al.  INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive , 2019, USENIX Annual Technical Conference.

[12]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Sachin Katti,et al.  Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[15]  Carole-Jean Wu,et al.  Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Minsoo Rhu,et al.  Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[17]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[18]  Hsu Cynthia,et al.  13.5 A 128Gb 1b/Cell 96-Word-Line-Layer 3D Flash Memory to Improve Random Read Latency with t PROG =75μs and t R =4μs , 2020 .

[19]  Tae Jun Ham,et al.  MERCI: efficient embedding reduction on commodity hardware via sub-query memoization , 2021, ASPLOS.

[20]  Martin D. Schatz,et al.  RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).