Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput. The complex dependences between mapping, communication and timing make prior scheduling techniques insufficient. Optimization-based approaches are too slow, and heuristic-based approaches cannot achieve high quality. Our first insight is that the algorithm can be solved in a series of phases with overlapping responsibilities to reduce complexity. Second, it is possible to combine optimization-based and stochastic-heuristic based search strategies, to exploit the best features of both. This leads to the two primary techniques we explore, phase overlapping and hybridization. In this work we explore the codesign of scheduling algorithms with a challenging-to-schedule programmable accelerator. We show we can improve its area by 35% by trimming its scheduling-friendly structures, using a scheduling algorithm that is 5× faster than the state-of-the-art optimization-based scheduler, with up to 2× better throughput.

[1]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Jason Cong,et al.  A scalable communication-aware compilation flow for programmable accelerators , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[5]  Antonia Zhai,et al.  Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.

[6]  Rastislav Bodík,et al.  Chlorophyll : Synthesis-Aided Compiler for Low-Power Spatial Architectures by Phitchaya Mangpo Phothilimthana , 2015 .

[7]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[8]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Hadi Esmaeilzadeh,et al.  Scale-Out Acceleration for Machine Learning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Kurt Keutzer,et al.  A Decomposition-based Constraint Optimization Approach for Statically Scheduling Task Graphs with Communication Delays to Multiprocessors , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[11]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[12]  Jason Cong,et al.  Composable accelerator-rich microprocessor enhanced for adaptivity and longevity , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[13]  Scott A. Mahlke,et al.  Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[15]  Steven Swanson,et al.  Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.

[16]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[18]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[19]  Xia Chen,et al.  A spatial path scheduling algorithm for EDGE architectures , 2006, ASPLOS XII.

[20]  Scott A. Mahlke,et al.  CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[21]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[22]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[23]  Monica Sin-Ling Lam,et al.  A Systolic Array Optimizing Compiler , 1989 .

[24]  Karthikeyan Sankaralingam,et al.  A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories , 2014, ACM Trans. Program. Lang. Syst..

[25]  Kemal Ebcioglu,et al.  CARS: a new code generation framework for clustered ILP processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[26]  Jens Palsberg,et al.  ILP-Based Resource-Aware Compilation , 2005 .

[27]  Jason Cong,et al.  Synthesis Algorithm for Application-Specific Homogeneous Processor Networks , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Jason Cong,et al.  A Fully Pipelined and Dynamically Composable Architecture of CGRA , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[29]  H. J. Caulfield,et al.  Optical implementation of systolic array processing , 1981 .

[30]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[31]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[32]  Bozena Kaminska,et al.  Functional synthesis of digital systems with TASS , 1994, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[33]  Karthikeyan Sankaralingam,et al.  Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[34]  Joel S. Emer,et al.  Exploiting spatial architectures for edit distance algorithms , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  C. Nicol A Coarse Grain Reconfigurable Array ( CGRA ) for Statically Scheduled Data Flow Computing , 2017 .

[36]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[37]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[38]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[39]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[40]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[41]  Keith H. Randall,et al.  Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[42]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Hadi Esmaeilzadeh,et al.  TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[45]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[46]  Will Moore,et al.  Systolic arrays : papers presented at the first International Workshop on Systolic Arrays, Oxford, 2-4 July 1986 , 1987 .

[47]  Sharad Malik,et al.  The design of dynamically reconfigurable datapath coprocessors , 2004, TECS.

[48]  Karthikeyan Sankaralingam,et al.  Pushing the limits of accelerator efficiency while retaining programmability , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[49]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[50]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[51]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[52]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[53]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).