A scalable communication-aware compilation flow for programmable accelerators

Programmable accelerators (PA) are receiving increased attention in domain-specific architecture designs to provide more general support for customization. In a PA-rich system, computational kernels are compiled into predefined PA templates and dynamically mapped to real PAs at runtime. This imposes a demanding challenge on the compiler side - that is, how to generate high-quality PA mapping code. Another important concern is the communication cost among PAs: if not handled properly at compile time, data transfers among tens or hundreds of accelerators in a PA-rich system will limit the overall performance gain. In this paper we present an efficient PA compilation flow, which is scalable for mapping large computation kernels into PA-rich architectures. Communication overhead is modeled and optimized in the proposed flow to reduce runtime data transfers among accelerators. Experimental results show that for 12 computation-intensive standard benchmarks, the proposed approach significantly improves compilation scalability, mapping quality and overall communication cost compared to state-of-art PA compilation approaches. We also evaluate the proposed flow on a recently developed PA-rich platform [1]; the final performance gain is improved by 49.5% on average.

[1]  Paolo Bonzini,et al.  Polynomial-time subgraph enumeration for automated instruction set extension , 2007 .

[2]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  Jason Cong,et al.  Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[4]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[5]  Michael D. Smith,et al.  A high-performance microarchitecture with hardware-programmable functional units , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  H. Franke,et al.  Introduction to the wire-speed processor and architecture , 2010, IBM J. Res. Dev..

[7]  Michael C. Huang,et al.  Efficient data streaming with on-chip accelerators: Opportunities and challenges , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[8]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  G. Shipman,et al.  Omega Library , 2011, Encyclopedia of Parallel Computing.

[11]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[12]  Scott A. Mahlke,et al.  Scalable subgraph mapping for acyclic computation accelerators , 2006, CASES '06.

[13]  Scott A. Mahlke,et al.  Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[14]  Jason Cong,et al.  Customizable Domain-Specific Computing , 2009, IEEE Design & Test of Computers.

[15]  Jason Cong,et al.  Pattern-based behavior synthesis for FPGA resource reduction , 2008, FPGA '08.

[16]  Tulika Mitra,et al.  Disjoint Pattern Enumeration for Custom Instructions Identification , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[17]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[18]  Paolo Bonzini,et al.  Polynomial-Time Subgraph Enumeration for Automated Instruction Set Extension , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).