Optimal versus Heuristic Global Code Scheduling

We present a global instruction scheduler based on integer linear programming (ILP) that was implemented experimentally in the Intel Itaniumreg product compiler. It features virtually the full scale of known EPIC scheduling optimizations, more than its heuristic counterpart in the compiler, GCS, and in contrast to the latter it computes optimal solutions in the form of schedules with minimal length. Due to our highly efficient ILP model it can solve problem instances with 500-750 instructions, and in combination with region scheduling we are able to schedule routines of arbitrary size. In experiments on five SPECreg CPU2006 integer benchmarks, ILP-scheduled code exhibits a 32% schedule length advantage and a 10% runtime speedup over GCS-scheduled code, at the highest compiler optimization levels typically used for SPEC submissions. We further study the impact of different code motion classes, region sizes, and target microarchitectures, gaining insights into the nature of the global instruction scheduling problem.

[1]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[2]  Sebastian Winkel,et al.  ILP-based Instruction Scheduling for IA-64 , 2001 .

[3]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[4]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[5]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[6]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[8]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[9]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[10]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[11]  Kishore N. Menezes,et al.  Wavefront scheduling: path based data representation and scheduling of subgraphs , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[13]  Kent Wilken,et al.  Optimal instruction scheduling using integer programming , 2000, PLDI.

[14]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[15]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[16]  Dharma P. Agrawal,et al.  Generalized Hypercube and Hyperbus Structures for a Computer Network , 1984, IEEE Transactions on Computers.

[17]  B. Ramakrishna Rau,et al.  EPIC: An Architecture for Instruction-Level Parallel Processors , 2000 .

[18]  Antonio González,et al.  Energy-effective issue logic , 2001, ISCA 2001.

[19]  Anant Agarwal,et al.  Versatility and VersaBench: A New Metric and a Benchmark Suite for Flexible Architectures , 2004 .

[20]  S. Winkel Optimal global instruction scheduling for the Itanium processor architecture , 2004 .

[21]  Toshihide Ibaraki,et al.  Resource allocation problems - algorithmic approaches , 1988, MIT Press series in the foundations of computing.

[22]  Simha Sethumadhavan,et al.  Late-binding: enabling unordered load-store queues , 2007, ISCA '07.

[23]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[24]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[26]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[27]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[28]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[29]  Soo-Mook Moon,et al.  Parallelizing nonnumerical code with selective scheduling and software pipelining , 1997, TOPL.

[30]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[31]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32]  Huiyang Zhou,et al.  Tree Traversal Scheduling: A Global Instruction Scheduling Technique for VLIW/EPIC Processors , 2001, LCPC.

[33]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[34]  José González,et al.  Back-end assignment schemes for clustered multithreaded processors , 2004, ICS '04.

[35]  Doug Burger,et al.  Implementation and Evaluation of On-Chip Network Architectures , 2006, 2006 International Conference on Computer Design.

[36]  BurgerDoug,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002 .

[37]  Rakesh Krishnaiyer,et al.  An Overview of the Intel® IA-64 Compiler , 1999 .

[38]  SankaralingamKarthikeyan,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003 .

[39]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[40]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[41]  William J. Dally Virtual-Channel Flow Control , 1992, IEEE Trans. Parallel Distributed Syst..

[42]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[43]  Robert A. van de Geijn,et al.  High performance dense linear algebra on a spatially distributed processor , 2008, PPoPP.

[44]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[45]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[46]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[47]  Michael Rodeh,et al.  Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[48]  Scott A. Mahlke,et al.  Characterizing the impact of predicated execution on branch prediction , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[49]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[50]  Daniel Kästner PROPAN: A Retargetable System for Postpass Optimisations and Analyses , 2000, LCTES.

[51]  Gürhan Küçük,et al.  Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, MICRO.

[52]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[53]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[54]  Chita R. Das,et al.  ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[55]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[56]  William J. Dally,et al.  Microarchitecture of a High-Radix Router , 2005, ISCA 2005.

[57]  Sebastian Winkel,et al.  Exploring the performance potential of Itanium/spl reg/ processors with ILP-based scheduling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[58]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[59]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[60]  Srilatha Manne,et al.  Power and energy reduction via pipeline balancing , 2001, ISCA 2001.

[61]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[62]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[63]  Yale N. Patt,et al.  Partitioned first-level cache design for clustered microarchitectures , 2003, ICS '03.

[64]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[65]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[66]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.