Compiling for EDGE architectures

Explicit data graph execution (EDGE) architectures offer the possibility of high instruction-level parallelism with energy efficiency. In EDGE architectures, the compiler breaks a program into a sequence of structured blocks that the hardware executes atomically. The instructions within each block communicate directly, instead of communicating through shared registers. The TRIPS EDGE architecture imposes restrictions on its blocks to simplify the microarchitecture: each TRIPS block has at most 128 instructions, issues at most 32 loads and/or stores, and executes at most 32 register bank reads and 32 writes. To detect block completion, each TRIPS block must produce a constant number of outputs (stores and register writes) and a branch decision. The goal of the TRIPS compiler is to produce TRIPS blocks full of useful instructions while enforcing these constraints. This paper describes a set of compiler algorithms that meet these sometimes conflicting goals, including an algorithm that assigns load and store identifiers to maximize the number of loads and stores within a block. We demonstrate the correctness of these algorithms in simulation on SPEC2000, EEMBC, and microbenchmarks extracted from SPEC2000 and others. We measure speedup in cycles over an Alpha 21264 on microbenchmarks.

[1]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[2]  Kathryn S. McKinley,et al.  Instruction scheduling for emerging communication-exposed architectures , 2004, PACT 2004.

[3]  Thomas M. Conte,et al.  Treegion scheduling for wide issue processors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[4]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[5]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[6]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[7]  Rajeev Motwani,et al.  Profile-driven instruction level parallel scheduling with application to super blocks , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[8]  Michael D. Smith,et al.  Quality and speed in linear-scan register allocation , 1998, PLDI.

[9]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[10]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[11]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[12]  Seth Copen Goldstein,et al.  Spatial computation , 2004, ASPLOS XI.

[13]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[14]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[15]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[16]  David I. August,et al.  Systematic compilation for predicated execution , 2000 .

[17]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[18]  Seth Copen Goldstein,et al.  Pegasus: An Efficient Intermediate Representation , 2002 .

[19]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[20]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[21]  John Wawrzynek,et al.  The Garp Architecture and C Compiler , 2000, Computer.

[22]  Kathryn S. McKinley,et al.  TRIPS Intermediate Language (TIL) Manual , 2005 .

[23]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[24]  Preston Briggs,et al.  Register allocation via graph coloring , 1992 .

[25]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[26]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[27]  Scott A. Mahlke,et al.  Profile-assisted instruction scheduling , 2007, International Journal of Parallel Programming.

[28]  Scott Mahlke,et al.  Exploiting Instruction Level Parallelism in the Presence of Conditional Branches , 1997 .

[29]  Mark N. Wegman,et al.  An efficient method of computing static single assignment form , 1989, POPL '89.

[30]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.