REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT’s "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing light-weight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3× increase in convolution performance/Watt with a 2× to 3.94× scaling in raw performance. Similarly, REDUCT achieves a 1.8× increase in inner-product performance/Watt with 2.8× scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.

[1]  Minjia Zhang,et al.  DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.

[2]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[3]  Wolfgang Lehner,et al.  NeMeSys - A Showcase of Data Oriented Near Memory Graph Processing , 2019, SIGMOD Conference.

[4]  Rachata Ausavarungnirun,et al.  Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.

[5]  Rachata Ausavarungnirun,et al.  Enabling Practical Processing in and near Memory for Data-Intensive Computing , 2019, DAC.

[6]  Nathan Beckmann,et al.  Livia: Data-Centric Computing Throughout the Memory Hierarchy , 2020, ASPLOS.

[7]  Jun Yang,et al.  DrAcc: a DRAM based Accelerator for Accurate CNN Inference , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[8]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Barukh Ziv,et al.  Lower Numerical Precision Deep Learning Inference and Training , 2018 .

[10]  Kiran Kumar Matam,et al.  GraphSSD: Graph Semantics Aware SSD , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[11]  Scott A. Mahlke,et al.  Duality Cache for Data Parallel Acceleration , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[12]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mikko H. Lipasti,et al.  Revolver: Processor architecture for power efficient loop execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Xuanzhe Liu,et al.  A First Look at Deep Learning Apps on Smartphones , 2018, WWW.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[17]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[20]  David Blaauw,et al.  Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Matti Siekkinen,et al.  Latency and throughput characterization of convolutional neural networks for mobile computer vision , 2018, MMSys.

[22]  S. Sagar Imambi,et al.  PyTorch , 2021, Programming with TensorFlow.

[23]  Carole-Jean Wu,et al.  Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Glenn Henry,et al.  High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[26]  Alireza Shafaei,et al.  FinCACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices , 2014, 2014 IEEE Computer Society Annual Symposium on VLSI.

[27]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[28]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[29]  Jeremy Kepner,et al.  Survey and Benchmarking of Machine Learning Accelerators , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[30]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[31]  D. Ernst Competing in Artificial Intelligence Chips: China’s Challenge amid Technology War , 2020 .

[32]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[33]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[34]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Yu Wang,et al.  GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Anshumali Shrivastava,et al.  SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems , 2019, MLSys.

[39]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Alexander Heinecke,et al.  Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  David Blaauw,et al.  Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Wei Chen,et al.  SkyLake-SP: A 14nm 28-Core xeon® processor , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[44]  Wei Wang,et al.  MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving , 2019, USENIX Annual Technical Conference.

[45]  Bevan M. Baas,et al.  Corrigendum to "Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm" [Integr. VLSI J. 58. (2017) 74-81] , 2019, Integr..

[46]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).