论文信息 - A scalable communication-aware compilation flow for programmable accelerators

A scalable communication-aware compilation flow for programmable accelerators

Programmable accelerators (PA) are receiving increased attention in domain-specific architecture designs to provide more general support for customization. In a PA-rich system, computational kernels are compiled into predefined PA templates and dynamically mapped to real PAs at runtime. This imposes a demanding challenge on the compiler side - that is, how to generate high-quality PA mapping code. Another important concern is the communication cost among PAs: if not handled properly at compile time, data transfers among tens or hundreds of accelerators in a PA-rich system will limit the overall performance gain. In this paper we present an efficient PA compilation flow, which is scalable for mapping large computation kernels into PA-rich architectures. Communication overhead is modeled and optimized in the proposed flow to reduce runtime data transfers among accelerators. Experimental results show that for 12 computation-intensive standard benchmarks, the proposed approach significantly improves compilation scalability, mapping quality and overall communication cost compared to state-of-art PA compilation approaches. We also evaluate the proposed flow on a recently developed PA-rich platform [1]; the final performance gain is improved by 49.5% on average.

[1] Paolo Bonzini,et al. Polynomial-time subgraph enumeration for automated instruction set extension , 2007 .

[2] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3] Jason Cong,et al. Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[4] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[5] Michael D. Smith,et al. A high-performance microarchitecture with hardware-programmable functional units , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[6] H. Franke,et al. Introduction to the wire-speed processor and architecture , 2010, IBM J. Res. Dev..

[7] Michael C. Huang,et al. Efficient data streaming with on-chip accelerators: Opportunities and challenges , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[8] Scott A. Mahlke,et al. An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10] G. Shipman,et al. Omega Library , 2011, Encyclopedia of Parallel Computing.

[11] Jason Cong,et al. CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[12] Scott A. Mahlke,et al. Scalable subgraph mapping for acyclic computation accelerators , 2006, CASES '06.

[13] Scott A. Mahlke,et al. Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[14] Jason Cong,et al. Customizable Domain-Specific Computing , 2009, IEEE Design & Test of Computers.

[15] Jason Cong,et al. Pattern-based behavior synthesis for FPGA resource reduction , 2008, FPGA '08.

[16] Tulika Mitra,et al. Disjoint Pattern Enumeration for Custom Instructions Identification , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[17] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[18] Paolo Bonzini,et al. Polynomial-Time Subgraph Enumeration for Automated Instruction Set Extension , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[19] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).