Achieving Out-of-Order Performance with Almost In-Order Complexity

There is still much performance to be gained by out-of-order processors with wider issue widths. However, traditional methods of increasing issue width do not scale; that is, they drastically increase design complexity and power requirements. This paper introduces the braid, a compile-time identified entity that enables the execution core to scale to wider widths by exploiting the small fanout and short lifetime of values produced by the program. Braid processing requires identification by the compiler, minor extensions to the ISA, and support by the microarchitecture. The result from processing braids is performance within 9% of a very aggressive conventional out-of-order microarchitecture with almost the complexity of an in-order implementation.

[1]  Joel S. Emer,et al.  Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[2]  Y. Patt,et al.  Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[3]  Mikko H. Lipasti,et al.  Macro-op Scheduling: Relaxing Scheduling Loop Constraints , 2003, MICRO.

[4]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  Javier Zalamea,et al.  Two-level hierarchical register file organization for VLIW processors , 2000, MICRO 33.

[6]  Craig Zilles,et al.  Dependence-Based Scheduling Revisited : A Tale of Two Baselines , 2007 .

[7]  Peter G. Sassone,et al.  Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[8]  Yale N. Patt,et al.  Facilitating superscalar processing via a combined static/dynamic register renaming scheme , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Yale N. Patt,et al.  On pipelining dynamic instruction scheduling logic , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[10]  Milo M. K. Martin,et al.  Exploiting dead value information , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  Mateo Valero,et al.  Virtual-physical registers , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[12]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  André Seznec,et al.  Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, MICRO 35.

[14]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[15]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  Nader Bagherzadeh,et al.  A scalable register file architecture for dynamically scheduled processors , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[17]  Yale Patt,et al.  Exploiting fine-grained parallelism through a combination of hardware and software techniques , 1991, ISCA '91.

[18]  M. Franklin,et al.  Register Traffic Analysis For Streamlining Inter-operation Communication In Fine-grain Parallel Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[19]  Gurindar S. Sohi,et al.  Characterizing and predicting value degree of use , 2002, MICRO.

[20]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[21]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[22]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[23]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO 1992.

[24]  Gurindar S. Sohi,et al.  Use-based register caching with decoupled indexing , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[25]  Satish Narayanasamy,et al.  A dependency chain clustered micro architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[26]  Victor V. Zyuban,et al.  The energy complexity of register files , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[27]  Craig B. Zilles,et al.  Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[28]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[29]  Amir Roth,et al.  Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[30]  Rajeev Balasubramonian,et al.  A High Performance Two-Level Register File Organization , 2001 .