Compiler-assisted Hybrid Operand Communication

Communication of operands among in-flight instructions can be power intensive, especially in superscalar processors where all result tags are broadcast to a small number of consumers through a multi-entry CAM. Token-based point-to-point communication of operands in dataflow architectures is highly efficient when each produced token has only one consumer, but inefficient when there are many consumers due to the construction of software fanout trees. Placing operands in registers is efficient for broadcasting the values which have consumers spread over a long lifetime, but inefficient for shorter-lived operations. This paper evaluates a compilerassisted hybrid instruction communication model that combine tokens instruction communication with statically assigned broadcast tags. Each fixed-size block of code is given a small number of architectural broadcast identifiers, which the compiler can assign to producers that have many consumers. Producers with few consumers rely on point-to-point communication through tokens. Producers whose result is live past the instruction block communicate with distant consumers through a register. Selecting the mechanism statically by the compiler relieves the hardware from categorizing instructions at runtime. At the same time, a compiler can categorize instructions better than dynamic selection does because the compiler analyzes a larger range of instructions. Furthermore, compiler could perform complex optimizations without hardware cost and execution-time penalty. We propose a compiler optimization to reuse broadcast tags for instructions with non-overlapping broadcast live ranges, the speedup is further improved without spending more power . The results show that this compiler-assisted hybrid token/broadcast model requires only eight architectural broadcasts per block, enabling highly efficient CAMs. This hybrid model reduces instruction communication energy by 28% compared to a strictly token-based dataflow model (and by over 2.7X compared to a hybrid model without compiler support), while simultaneously increasing performance by 8% on average across the SPECINT and EEMBC benchmarks, running as single threads on 16 composed, dual-issue EDGE cores.

[1]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO.

[4]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[5]  Ramon Canal,et al.  A low-complexity issue logic , 2000, ICS '00.

[6]  Kathryn S. McKinley,et al.  Strategies for mapping dataflow blocks to distributed hardware , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[7]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[8]  Mateo Valero,et al.  A new pointer-based instruction queue design and its power-performance evaluation , 2005, 2005 International Conference on Computer Design.

[9]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[10]  Gürhan Küçük,et al.  Energy-efficient issue queue design , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[11]  Michael C. Huang,et al.  Energy-efficient hybrid wakeup logic , 2002, ISLPED '02.

[12]  M.A. Ramirez,et al.  Direct Instruction Wakeup for Out-of-Order Processors , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[13]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[14]  Ramon Canal,et al.  Reducing the complexity of the issue logic , 2001, ICS '01.

[15]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[16]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[17]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[18]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[19]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.