Variability in architectural simulations of multi-threaded workloads

Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workload's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.

[1]  Josep Torrellas,et al.  A direct-execution framework for fast and accurate simulation of superscalar processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[2]  K. Driesen,et al.  Accurate indirect branch prediction , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[3]  Alaa R. Alameldeen,et al.  Timestamp snooping: an approach for extending SMPs , 2000, SIGP.

[4]  H. Frank,et al.  Statistics: concepts and applications , 1996 .

[5]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[6]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[7]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[8]  Mikko H. Lipasti,et al.  Precise and Accurate Processor Simulation , 2002 .

[9]  Yale N. Patt,et al.  The effects of mispredicted-path execution on branch prediction structures , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[10]  BarrosoLuiz Andre,et al.  Memory system characterization of commercial workloads , 1998 .

[11]  A. Winsor Sampling techniques. , 2000, Nursing times.

[12]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[13]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[14]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[15]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[16]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[17]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[18]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[19]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[20]  Erik Hagersten,et al.  Memory Characterization of the ECperf Benchmark , 2003 .

[21]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Brad Calder,et al.  Time Varying Behavior of Programs , 1999 .

[23]  D. Patterson,et al.  Performance characterization of a quad Pentium Pro SMP using OLTP workloads , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[24]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[25]  Sandhya Dwarkadas,et al.  Execution-driven simulation of multiprocessors: address and timing analysis , 1994, TOMC.

[26]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  André Seznec,et al.  Choosing representative slices of program execution for microarchitecture simulations: a preliminary , 2000 .

[28]  Doug Burger,et al.  Measuring Experimental Error in Microprocessor Simulation , 2001, ISCA 2001.

[29]  Steven R. Kunkel,et al.  A multithreaded PowerPC processor for commercial servers , 2000, IBM J. Res. Dev..

[30]  Maged M. Michael,et al.  The design of COMPASS: an execution driven simulator for commercial applications running on shared memory multiprocessors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[31]  Trevor N. Mudge,et al.  The YAGS branch prediction scheme , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[32]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[33]  A. Veidenbaum,et al.  The cedar system and an initial performance study , 1993, ISCA '93.

[34]  Ramendra K. Sahoo,et al.  MemorIES: a programmable, real-time hardware emulation tool for multiprocessor server design , 2000, SIGP.