Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

The demand for high performance has driven acyclic computation accelerators into extensive use in modern embedded and desktop architectures. Accelerators that are ideal from a software perspective, are difficult or impossible to integrate in many modern architectures, though, due to area and timing requirements. This reality is coupled with the observation that many application domains under-utilize accelerator hardware, because of the narrow data they operate on and the nature of their computation. In this work, we take advantage of these facts to design accelerators capable of executing in modern architectures by narrowing datapath width and reducing interconnect. Novel compiler techniques are developed in order to generate high-quality code for the reduced-cost accelerators and prevent performance loss to the extent possible. First, data width profiling is used to statistically determine how wide program data will be at run time. This information is used by the subgraph mapping algorithm to optimally select subgraphs for execution on targeted narrow accelerators. Overall, our data-centric compilation techniques achieve on average 6.5%, and up to 12%, speed up over previous subgraph mapping algorithms for 8-bit accelerators. We also show that, with appropriate compiler support, the increase in the total number of execution cycles in reduced-interconnect accelerators is less than 1% of the fully-connected accelerator

[1]  Nathan Clark,et al.  An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors , 2005, ISCA 2005.

[2]  Kanad Ghose,et al.  Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[3]  Alfred V. Aho,et al.  Code generation using tree matching and dynamic programming , 1989, ACM Trans. Program. Lang. Syst..

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  Gilles Pokam,et al.  Speculative software management of datapath-width for energy optimization , 2004, LCTES '04.

[6]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[7]  R. Canal,et al.  Very low power pipelines using significance compression , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[8]  Rainer Leupers,et al.  Instruction selection for embedded DSPs with complex instructions , 1996, Proceedings EURO-DAC '96. European Design Automation Conference with EURO-VHDL '96 and Exhibition.

[9]  James E. Smith,et al.  Software-controlled operand-gating , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[10]  Gabriel H. Loh,et al.  Static strands: safely collapsing dependence chains for increasing embedded power efficiency , 2005, LCTES.

[11]  Giovanni De Micheli,et al.  Automatic instruction set extension and utilization for embedded processors , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[12]  Kurt Keutzer,et al.  Instruction selection using binate covering for code size optimization , 1995, Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[13]  Olivier Temam,et al.  From sequences of dependent instructions to functions: an approach for improving performance without ILP or speculation , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[15]  Kim Henrick,et al.  Common subgraph isomorphism detection by backtracking search , 2004, Softw. Pract. Exp..

[16]  Nikil D. Dutt,et al.  ISEGEN: generation of high-quality instruction set extensions by iterative improvement , 2005, Design, Automation and Test in Europe.

[17]  Gabriel H. Loh Exploiting data-width locality to increase superscalar execution bandwidth , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[18]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Paolo Ienne,et al.  Automatic application-specific instruction-set extensions under microarchitectural constraints , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[20]  James E. Smith,et al.  Very low power pipelines using significance compression , 2000, MICRO 33.

[21]  Mikko H. Lipasti,et al.  An approach for implementing efficient superscalar CISC processors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[22]  Amir Roth,et al.  Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23]  James E. Smith,et al.  Using dynamic binary translation to fuse dependent instructions , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[24]  Scott A. Mahlke,et al.  Scalable subgraph mapping for acyclic computation accelerators , 2006, CASES '06.

[25]  Stamatis Vassiliadis,et al.  High-Performance 3-1 Interlock Collapsing ALU's , 1994, IEEE Trans. Computers.

[26]  Kerstin Eder,et al.  International Symposium on Code Generation and Optimization. CGO 2003 , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..