论文信息 - Tolerating Branch Predictor Latency on SMT

Tolerating Branch Predictor Latency on SMT

Simultaneous Multithreading (SMT) tolerates latency by executing instructions from multiple threads. If a thread is stalled, resources can be used by other threads. However, fetch stall conditions caused by multi-cycle branch predictors prevent SMT to achieve all its potential performance, since the flow of fetched instructions is halted.

Mateo Valero | Ayose Falcón | Alex Ramírez | Oliverio J. Santana

[1] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[2] Dean M. Tullsen,et al. Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[3] Dean M. Tullsen,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[4] Daniel A. Jiménez,et al. Reconsidering complex branch predictors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5] Mario Nemirovsky,et al. Increasing superscalar performance through multistreaming , 1995, PACT.

[6] Glenn Reinman,et al. Optimizations Enabled by a Decoupled Front-End Architecture , 2001, IEEE Trans. Computers.

[7] Daniel A. Jiménez,et al. The impact of delay on the design of branch predictors , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[8] Pierre Michaud,et al. Trading Conflict And Capacity Aliasing In Conditional Branch Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10] James E. Smith,et al. Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11] Brad Calder,et al. Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[12] André Seznec,et al. Effective ahead pipelining of instruction block address generation , 2003, ISCA '03.

[13] Mateo Valero,et al. Fetching instruction streams , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[14] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15] Ken Mai,et al. The future of wires , 2001, Proc. IEEE.