An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources on such machines would be largely idle. In contrast to superscalars, simultaneous multithreaded (SMT) processors achieve high resource utilization by issuing instructions from multiple threads every cycle. An SMT processor thus has two means of hiding latency: speculation and multithreaded execution. However, these two techniques may conflict; on an SMT processor, wrong-path speculative instructions from one thread may compete with and displace useful instructions from another thread. For this reason, it is important to understand the trade-offs between these two latency-hiding techniques, and to ask whether multithreaded processors should speculate differently than conventional superscalars.This paper evaluates the behavior of instruction speculation on SMT processors using both multiprogrammed (SPECINT and SPECFP) and multithreaded (the Apache Web server) workloads. We measure and analyze the impact of speculation and demonstrate how speculation on an 8-context SMT differs from superscalar speculation. We also examine the effect of speculation-aware fetch and branch prediction policies in the processor. Our results quantify the extent to which (1) speculation is critical to performance on a multithreaded processor because it ensures an ample supply of parallelism to feed the functional units, and (2) SMT actually enhances the effectiveness of speculative execution, compared to a superscalar processor by reducing the impact of branch misprediction. Finally, we quantify the impact of both hardware configuration and workload characteristics on speculation's usefulness and demonstrate that, in nearly all cases, speculation is beneficial to SMT performance.

[1]  David W. Wall,et al.  Speculative Execution and Instruction-Level Parallelism , 1999 .

[2]  S. Parekh,et al.  Tuning Compiler Optimizations for Simultaneous Multithreading , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Qing Yang,et al.  Measurement, analysis and performance improvement of the Apache Web server , 1999, 1999 IEEE International Performance, Computing and Communications Conference (Cat. No.99CH36305).

[4]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[5]  S. Parekh,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[6]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[7]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[8]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[9]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[10]  Susan J. Eggers,et al.  An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture , 2000, ASPLOS.

[11]  Eric Rotenberg,et al.  Assigning confidence to conditional branch predictions , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[12]  Richard E. Kessler,et al.  Performance analysis of the Alpha 21264-based Compaq ES40 system , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  S. McFarling Combining Branch Predictors , 1993 .

[14]  M.D. Smith,et al.  An Analysis of Dynamic Branch Prediction Schemes on System Workloads , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[15]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[16]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[17]  Susan J. Eggers,et al.  An analysis of operating system behavior on a simultaneous multithreaded architecture , 2000, ASPLOS IX.

[18]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[19]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[20]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[21]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[22]  Dean M. Tullsen,et al.  Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[23]  Linley Gwennap,et al.  New Algorithm Improves Branch Prediction Better Accuracy Required for Highly Superscalar Designs , 1995 .

[24]  Dirk Grunwald,et al.  Instruction fetch mechanisms for multipath execution processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[25]  Joel S. Emer,et al.  Simultaneous multithreading: multiplying alpha performance , 1999 .

[26]  Prasad N. Golla,et al.  A comparison of the effect of branch prediction on multithreaded and scalar architectures , 1998, CARN.

[27]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[29]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[30]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[31]  André Seznec,et al.  Branch prediction and simultaneous multithreading , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[32]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).