Improving dynamic cluster assignment for clustered trace cache processors

This work examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed cluster assignment techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters and a heavily partitioned instruction window, increase the degree of difficulty for effective cluster assignment.In this work, the trace cache and fill unit are used to perform dynamic cluster assignment. The retire-time fill unit analysis is aided by a dynamic profiling mechanism embedded within the trace cache. This mechanism provides information about inter-trace data dependencies, an element absent in previous retire-time CTCP cluster assignment work. The strategy proposed in this work leads to more intra-cluster data forwarding and shorter data forwarding distances. In addition, performing cluster assignment at retire time reduces issue-time complexity and eliminates early pipeline stages. This increases overall performance for integer programs by 11.5% over our base CTCP architecture. This speedup is significantly higher than a previously proposed retire-time CTCP assignment strategy. Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as for media benchmarks.

[1]  Victor V. Zyuban,et al.  Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.

[2]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  Ramon Canal,et al.  Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[5]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[6]  Quinn Jacobson,et al.  Instruction pre-processing in trace processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  Sanjay J. Patel,et al.  Critical Issues Regarding the Trace Cache Fetch Mechanism , 1997 .

[8]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[9]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[10]  Lizy Kurian John,et al.  Latency and energy aware value prediction for high-frequency processors , 2002, ICS '02.

[11]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[12]  José Duato,et al.  Efficient interconnects for clustered microarchitectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[13]  Lizy K. John,et al.  Cluster Assignment Strategies for a Clustered Trace Cache Processor , 2003 .

[14]  Ramon Canal,et al.  A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[15]  Yale N. Patt,et al.  Putting the fill unit to work: dynamic optimizations for trace cache microprocessors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[17]  Norman P. Jouppi,et al.  An Integrated Cache Timing and Power Model , 2002 .