The Good Block: Hardware/Software Design for Composable, Block-Atomic Processors

Power consumption, complexity, and on-chip latency are forcing  computer systems to exploit more parallelism efficiently. Explicit  Dataflow Graph Execution (EDGE) architectures seek to expose parallelism by dividing programs into blocks of efficient dataflow operations, exposing inter and intra-block concurrency. This paper studies the balance of complexity and capability between EDGE architectures and compilers. We address three main questions. (1) What are the appropriate block granularities for achieving high  performance efficiently? (2) What are good block instruction selection policies?  (3) What architecture and compiler support do  these designs require?  Our results show that the compiler requires multiple block sizes to adapt applications to block-atomic hardware and achieve high performance.  Although the architecture for a single size is simpler, the additions for variable sizes are modest and ease hardware  configuration. We propose hand-crafted and learned compiler  policies for block formation. We find the best policies provide significant advantages of up to a factor of 3 in some  configurations. Policies vary based on (1) the amount of  parallelism inherent in the application, e.g., for integer and numerical applications, and (2) the available parallel  resources. The resulting configurable architecture and compiler efficiently expose and exploit software and hardware parallelism.

[1]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[2]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[3]  Y. Patt,et al.  Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[4]  Per Stenström Transactions on High-Performance Embedded Architectures and Compilers I , 2007, Trans. HiPEAC.

[5]  Karthikeyan Sankaralingam,et al.  Dataflow Predication , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[6]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[7]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[8]  Keith D. Cooper,et al.  Value-driven redundancy elimination , 1996 .

[9]  Aaron Smith,et al.  Merging Head and Tail Duplication for Convergent Hyperblock Formation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Kathryn S. McKinley,et al.  Strategies for mapping dataflow blocks to distributed hardware , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[11]  DahlinMike,et al.  Scaling to the End of Silicon with EDGE Architectures , 2004 .

[12]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[13]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[14]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[15]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[16]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[17]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Yale N. Patt,et al.  Exploiting Fine-Grained Parallelism Through a Combination of Hardware and Software Techniques , 1991, ISCA.

[19]  Kathryn S. McKinley,et al.  Atomic block formation for explicit data graph execution architectures , 2010 .

[20]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[21]  Yale N. Patt,et al.  Hardware Support For Large Atomic Units in Dynamically Scheduled Machines , 1988, [1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21.

[22]  Kathryn S. McKinley,et al.  Convergent Compilation Applied to Loop Unrolling , 2007, Trans. High Perform. Embed. Archit. Compil..

[23]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[24]  Yale N. Patt,et al.  Enhancing instruction scheduling with a block-structured ISA , 2007, International Journal of Parallel Programming.

[25]  Milos D. Ercegovac,et al.  The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .