CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime parameters to the programmers who need to explicitly manage how computation is scheduled both spatially and temporally. In fact, different scheduling choices can lead to wide variations in performance and efficiency, motivating the need for a fast and efficient search strategy to navigate the vast scheduling space.To address this challenge, we present CoSA, a constrained-optimization-based approach for scheduling DNN accelerators. As opposed to existing approaches that either rely on designers’ heuristics or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem that can be deterministically solved using mathematical optimization techniques. Specifically, CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a mixed-integer programming (MIP) problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot. We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5× across a wide range of DNN networks while improving the time-to-solution by 90×.

[1]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[2]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[3]  Timothy M. Jones,et al.  Janus: Statically-Driven and Profile-Guided Automatic Dynamic Binary Parallelisation , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Mingyu Gao,et al.  Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.

[5]  Rastislav Bodík,et al.  Chlorophyll : Synthesis-Aided Compiler for Low-Power Spatial Architectures by Phitchaya Mangpo Phothilimthana , 2015 .

[6]  Hyoukjun Kwon,et al.  MAERI : Enabling Flexible Dataflow Mapping over DNN Accelerators via Programmable Interconnects , 2018 .

[7]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[8]  Rajeev Alur,et al.  Search-based program synthesis , 2018, Commun. ACM.

[9]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[10]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[13]  A. Parashar,et al.  Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators , 2020 .

[14]  Alexander Aiken,et al.  Automatic generation of peephole superoptimizers , 2006, ASPLOS XII.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[18]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Rajeev Barua,et al.  Heterogeneous memory management for embedded systems , 2001, CASES '01.

[21]  Hongbin Zheng,et al.  Polly – Polyhedral optimization in LLVM , 2012 .

[22]  Uday Bondhugula,et al.  The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests , 2016, TOPL.

[23]  David Cox,et al.  Triton: an intermediate language and compiler for tiled neural network computations , 2019, MAPL@PLDI.

[24]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[25]  Vikas Chandra,et al.  Mind mappings: enabling efficient algorithm-accelerator mapping space search , 2021, ASPLOS.

[26]  Christopher Torng,et al.  INVITED: A Modular Digital VLSI Flow for High-Productivity SoC Design , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[27]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[28]  An Wang,et al.  Swizzle Inventor: Data Movement Synthesis for GPU Kernels , 2019, ASPLOS.

[29]  Sheng-Chun Kao,et al.  GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[30]  Jian Weng,et al.  Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign , 2018, PACT.

[31]  Jason Cong,et al.  An efficient and versatile scheduling algorithm based on SDC formulation , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[32]  Venkatesh Akella,et al.  AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming , 2020, ASPLOS.

[33]  Jae-Gon Lee,et al.  7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[34]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[35]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[36]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[37]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[38]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[39]  Lawrence D. Jackel,et al.  Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car , 2017, ArXiv.

[40]  S. Alexander Chin,et al.  An Architecture-Agnostic Integer Linear Programming Approach to CGRA Mapping , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Christoforos E. Kozyrakis,et al.  TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[43]  James Demmel,et al.  Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds , 2020, SPAA.

[44]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[45]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[47]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[48]  Ilya Levin,et al.  Self-checking of FPGA-based control units , 1999, Proceedings Ninth Great Lakes Symposium on VLSI.

[49]  Cédric Bastoul,et al.  Predictive Modeling in a Polyhedral Optimization Space , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[50]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[51]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[52]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[53]  Aviral Shrivastava,et al.  dMazeRunner , 2019, ACM Trans. Embed. Comput. Syst..

[54]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[55]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[56]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[57]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[58]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[59]  Glenn Henry,et al.  High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[60]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[61]  Armando Solar-Lezama,et al.  Programming by sketching for bit-streaming programs , 2005, PLDI '05.

[62]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[65]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[66]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).