Matrix scheduler reloaded

From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the benefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modifications to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple observation that the wakeup and picker matrices are sparse, even at small sizes; thus small indirection tables can be used to greatly reduce their width and latency. This technique can be used to create quicker iso-performance schedulers (17-58% reduced critical path) or larger iso-timing schedulers (7-26% IPC increase). Importantly, the power and area requirements of the additional hardware are likely offset by the greatly reduced matrix sizes and subsuming the functionality of the power-hungry allocation CAMs.

[1]  Amir Roth,et al.  Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[2]  Peter G. Sassone,et al.  Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[3]  C. Kirner,et al.  Design of the matching unit of a massively parallel dataflow computing system , 1994, Proceedings of the First International Conference on Massively Parallel Computing Systems (MPCS) The Challenges of General-Purpose and Special-Purpose Computing.

[4]  Chris Wilkerson,et al.  Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[5]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[6]  Mikko H. Lipasti,et al.  Half-price architecture , 2003, ISCA '03.

[7]  Yale N. Patt,et al.  An investigation of the performance of various dynamic scheduling techniques , 1992, MICRO.

[8]  Pierre Michaud,et al.  Data-flow prescheduling for large instruction windows in out-of-order processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[9]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Joel S. Emer,et al.  Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[11]  Yale N. Patt,et al.  An investigation of the performance of various dynamic scheduling techniques , 1992, MICRO 1992.

[12]  Mateo Valero,et al.  Virtual-physical registers , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[13]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[14]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[15]  Rastislav Bodík,et al.  Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[16]  Masahiro Goshima,et al.  A high-speed dynamic instruction scheduling scheme for superscalar processors , 2001, MICRO.

[17]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[18]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[19]  Rastislav Bodík,et al.  Slack: maximizing performance under technological constraints , 2002, ISCA.

[20]  Gabriel H. Loh,et al.  Static strands: safely collapsing dependence chains for increasing embedded power efficiency , 2005, LCTES.

[21]  T. Fischer,et al.  Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[22]  T. Austin,et al.  Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[23]  Rahul Razdan,et al.  The Alpha 21264: a 500 MHz out-of-order execution microprocessor , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[24]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[25]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[26]  Yale N. Patt,et al.  Select-free instruction scheduling logic , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[27]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[28]  José Duato,et al.  On-chip interconnects and instruction steering schemes for clustered microarchitectures , 2005, IEEE Transactions on Parallel and Distributed Systems.

[29]  Yale N. Patt,et al.  On pipelining dynamic instruction scheduling logic , 2000, MICRO 33.

[30]  Todd M. Austin,et al.  Efficient dynamic scheduling through tag elimination , 2002, ISCA.