DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference
暂无分享,去创建一个
Carole-Jean Wu | Gu-Yeon Wei | Hsien-Hsin S. Lee | Vikram Saraph | David Brooks | Brandon Reagen | Udit Gupta | Xiaodong Wang | Samuel Hsia | Hsien-Hsin S. Lee | Brandon Reagen | Gu-Yeon Wei | D. Brooks | Udit Gupta | Carole-Jean Wu | Xiaodong Wang | V. Saraph | Samuel Hsia
[1] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.
[2] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[3] James Zou,et al. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems , 2019, 2021 IEEE International Symposium on Information Theory (ISIT).
[4] Daniel Wong,et al. KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[5] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Laxmi N. Bhuyan,et al. CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs , 2016, ICS.
[7] Margaret Martonosi,et al. Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[8] Adrian Moga,et al. High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[9] Daniel Sánchez,et al. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[10] Jiyan Yang,et al. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems , 2019, KDD.
[11] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[12] Aamer Jaleel,et al. Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[13] Gu-Yeon Wei,et al. Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[14] Chang Zhou,et al. Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.
[15] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[16] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[17] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[18] Yujeong Choi,et al. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[19] Carole-Jean Wu,et al. Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[20] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] Gu-Yeon Wei,et al. MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.
[22] Asit K. Mishra,et al. From High-Level Deep Network Models to FPGA Acceleration , 2016 .
[23] John Riedl,et al. Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.
[24] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.
[25] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.
[26] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[27] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[28] Alexander M. Rush,et al. MASR: A Modular Accelerator for Sparse RNNs , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[29] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[30] Aamer Jaleel,et al. ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.
[31] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.
[32] Guorui Zhou,et al. Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.
[33] Carole-Jean Wu,et al. The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[34] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[35] Carole-Jean Wu,et al. MLPerf Inference Benchmark , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[36] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.
[37] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[38] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .
[39] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[40] Yuan He,et al. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.
[41] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[42] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[43] Minsoo Rhu,et al. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[44] Jonathan Baxter,et al. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.
[45] Christopher W. Fletcher,et al. Morph: Flexible Acceleration for 3D CNN-Based Video Understanding , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[46] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[47] Xiangyu Li,et al. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[48] Carole-Jean Wu,et al. Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.
[49] Heng-Tze Cheng,et al. Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.
[50] Li Wei,et al. Recommending what video to watch next: a multitask ranking system , 2019, RecSys.
[51] Chenyang Lu,et al. Work stealing for interactive services to meet target latency , 2016, PPoPP.
[52] Ronald G. Dreslinski,et al. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.
[53] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[54] Minsoo Rhu,et al. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.
[55] Margaret Martonosi,et al. Formal online methods for voltage/frequency control in multiple clock domain microprocessors , 2004, ASPLOS XI.
[56] Steven Teitelbaum,et al. Where are the data? , 2011, Plastic and reconstructive surgery.
[57] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[58] Lin Zhong,et al. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[59] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.
[60] Amar Phanishayee,et al. Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).
[61] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[62] Gu-Yeon Wei,et al. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.
[63] Albert G. Greenberg,et al. The nature of data center traffic: measurements & analysis , 2009, IMC '09.
[64] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.
[65] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[66] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).