Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign
暂无分享,去创建一个
Jian Weng | Karthikeyan Sankaralingam | Newsha Ardalani | Tony Nowatzki | Newsha Ardalani | K. Sankaralingam | Jian Weng | Tony Nowatzki | Karthikeyan Sankaralingam
[1] Peng Zhang,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[2] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.
[3] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[4] Jason Cong,et al. A scalable communication-aware compilation flow for programmable accelerators , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).
[5] Antonia Zhai,et al. Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.
[6] Rastislav Bodík,et al. Chlorophyll : Synthesis-Aided Compiler for Low-Power Spatial Architectures by Phitchaya Mangpo Phothilimthana , 2015 .
[7] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.
[8] Scott A. Mahlke,et al. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[9] Hadi Esmaeilzadeh,et al. Scale-Out Acceleration for Machine Learning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[10] Kurt Keutzer,et al. A Decomposition-based Constraint Optimization Approach for Statically Scheduling Task Graphs with Communication Delays to Multiprocessors , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.
[11] Kathryn S. McKinley,et al. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[12] Jason Cong,et al. Composable accelerator-rich microprocessor enhanced for adaptivity and longevity , 2013, International Symposium on Low Power Electronics and Design (ISLPED).
[13] Scott A. Mahlke,et al. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[14] Jason Cong,et al. CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.
[15] Steven Swanson,et al. Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.
[16] Karthikeyan Sankaralingam,et al. Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[17] Simha Sethumadhavan,et al. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[18] John R. Ellis,et al. Bulldog: A Compiler for VLIW Architectures , 1986 .
[19] Xia Chen,et al. A spatial path scheduling algorithm for EDGE architectures , 2006, ASPLOS XII.
[20] Scott A. Mahlke,et al. CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.
[21] Rudy Lauwereins,et al. Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.
[22] Henry Hoffmann,et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.
[23] Monica Sin-Ling Lam,et al. A Systolic Array Optimizing Compiler , 1989 .
[24] Karthikeyan Sankaralingam,et al. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories , 2014, ACM Trans. Program. Lang. Syst..
[25] Kemal Ebcioglu,et al. CARS: a new code generation framework for clustered ILP processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.
[26] Jens Palsberg,et al. ILP-Based Resource-Aware Compilation , 2005 .
[27] Jason Cong,et al. Synthesis Algorithm for Application-Specific Homogeneous Processor Networks , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[28] Jason Cong,et al. A Fully Pipelined and Dynamically Composable Architecture of CGRA , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.
[29] H. J. Caulfield,et al. Optical implementation of systolic array processing , 1981 .
[30] John Wawrzynek,et al. Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.
[31] Yoav Etsion,et al. Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[32] Bozena Kaminska,et al. Functional synthesis of digital systems with TASS , 1994, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..
[33] Karthikeyan Sankaralingam,et al. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[34] Joel S. Emer,et al. Exploiting spatial architectures for edit distance algorithms , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[35] C. Nicol. A Coarse Grain Reconfigurable Array ( CGRA ) for Statically Scheduled Data Flow Computing , 2017 .
[36] Karthikeyan Sankaralingam,et al. A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.
[37] Rudy Lauwereins,et al. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.
[38] Vicki H. Allan,et al. Software pipelining , 1995, CSUR.
[39] Karthikeyan Sankaralingam,et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.
[40] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[41] Keith H. Randall,et al. Denali: a goal-directed superoptimizer , 2002, PLDI '02.
[42] Amin Ansari,et al. Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[43] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[44] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[45] Vivek Sarkar,et al. Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.
[46] Will Moore,et al. Systolic arrays : papers presented at the first International Workshop on Systolic Arrays, Oxford, 2-4 July 1986 , 1987 .
[47] Sharad Malik,et al. The design of dynamically reconfigurable datapath coprocessors , 2004, TECS.
[48] Karthikeyan Sankaralingam,et al. Pushing the limits of accelerator efficiency while retaining programmability , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[49] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.
[50] Kunle Olukotun,et al. Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[51] Thomas M. Conte,et al. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[52] Scott A. Mahlke,et al. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[53] Gu-Yeon Wei,et al. MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).