Summarizing multiprocessor program execution with versatile, microarchitecture-independent snapshots

Computer architects rely heavily on software simulation to evaluate, refine, and validate new designs before they are implemented. However, simulation time continues to increase as computers become more complex and multicore designs become more common. This thesis investigates software structures and algorithms for quickly simulating modern cache-coherent multiprocessors by amortizing the time spent to simulate the memory system and branch predictors. The Memory Timestamp Record (MTR) summarizes the directory and cache state of a multiprocessor system in a compact data structure. A single MTR snapshot is versatile enough to reconstruct the microarchitectural state resulting from various coherence protocols and cache organizations. The MTR may be quickly updated by each simulated processor during a fast-forwarding phase and optionally stored offline for reuse. To fill large branch prediction tables, we introduce Branch Predictor-based Compression (BPC) which compactly stores a branch trace so that it may be used to fill in any branch predictor structure. An entire BPC trace requires less space than single discrete predictor snapshots, and it may be decompressed 3-6 times faster than performing functional simulation.

[1]  Krste Asanovic,et al.  Accelerating Multiprocessor Simulation with a Memory Timestamp Record , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[2]  Louise Trevillyan,et al.  Representative traces for processor models with infinite cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[3]  Aleksandar Milenkovic,et al.  Demystifying Intel Branch Predictors , 2005 .

[4]  Luis Ceze,et al.  Full Circle: Simulating Linux Clusters on Linux Clusters , 2003 .

[5]  Amir Roth,et al.  DISE: a programmable macro engine for customizing applications , 2003, ISCA '03.

[6]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[8]  Michael D. Smith,et al.  Tracing with Pixie , 1991 .

[9]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[10]  Martin Burtscher,et al.  Compressing extended program traces using value predictors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[11]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[12]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[13]  Douglas W. Clark,et al.  A Characterization of Processor Performance in the vax-11/780 , 1984, ISCA '84.

[14]  Alaa R. Alameldeen,et al.  Addressing Workload Variability in Architectural Simulations , 2003, IEEE Micro.

[15]  Gary Peterson,et al.  UltraSPARC-I , 1995, DAC '95.

[16]  S. McFarling Combining Branch Predictors , 1993 .

[17]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[18]  Babak Falsafi,et al.  ProtoFlex: Co-simulation for Component-wise FPGA Emulator Development , 2006 .

[19]  Thomas F. Wenisch,et al.  SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.

[20]  Alan Jay Smith,et al.  Efficient Analysis of Caching Systems , 1987 .

[21]  Margaret Martonosi,et al.  Speculative Updates of Local and Global Branch History: A Quantitative Analysis , 2000, J. Instr. Level Parallelism.

[22]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[23]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[25]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[26]  Jeanine Cook,et al.  Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[27]  Brad Calder,et al.  A co-phase matrix to guide simultaneous multithreading simulation , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[28]  James Archibald,et al.  BACH: BYU Address Collection Hardware, The Collection of Complete Traces , 1992 .

[29]  Gabriel H. Loh Revisiting the performance impact of branch predictor latencies , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[30]  Lixin Zhang,et al.  Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[31]  Thomas M. Conte,et al.  Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation , 1998, IEEE Trans. Computers.

[32]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[33]  Sarita V. Adve,et al.  Improving the accuracy vs. speed tradeoff for simulating shared-memory multiprocessors with ILP processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[34]  R. L. Sites,et al.  ATUM: a new technique for capturing address traces using microcode , 1986, ISCA '86.

[35]  Martin Burtscher,et al.  Automatic generation of high-performance trace compressors , 2005, International Symposium on Code Generation and Optimization.

[36]  James R. Goodman,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, MICRO.

[37]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[38]  Lieven Eeckhout,et al.  Considering all starting points for simultaneous multithreading simulation , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[39]  Lieven Eeckhout,et al.  Efficient Sampling Startup for Sampled Processor Simulation , 2005, HiPEAC.

[40]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[41]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[42]  Eriko Nurvitadhi,et al.  Design, implementation, and verification of active cache emulator (ACE) , 2006, FPGA '06.

[43]  Thomas F. Wenisch,et al.  Simulation sampling with live-points , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[44]  Sandhya Dwarkadas,et al.  Efficient Simulation of Parallel Computer Systems , 1991, Int. J. Comput. Simul..

[45]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[46]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[47]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[48]  André Seznec,et al.  Choosing representative slices of program execution for microarchitecture simulations: a preliminary , 2000 .

[49]  Anant Agarwal,et al.  Blocking: exploiting spatial locality for trace compaction , 1990, SIGMETRICS '90.

[50]  Josep Torrellas,et al.  The Augmint multiprocessor simulation toolkit for Intel x86 architectures , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[51]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[52]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[53]  Trevor N. Mudge,et al.  Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time Through Binary Modification , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[54]  A. Dain Samples,et al.  Mache: no-loss trace compaction , 1989, SIGMETRICS '89.

[55]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[56]  M. Milenkovic,et al.  Exploiting streams in instruction and data address trace compression , 2003, 2003 IEEE International Conference on Communications (Cat. No.03CH37441).

[57]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[58]  Cong Fu,et al.  The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors , 2005, CARN.

[59]  Karel Driesen,et al.  The cascaded predictor: economical and adaptive branch target prediction , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[60]  G. Blelloch Introduction to Data Compression * , 2022 .

[61]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[62]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[63]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[64]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[65]  Per Stenström,et al.  Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[66]  Peter K. Szwed,et al.  SimSnap: fast-forwarding via native execution and application-level checkpointing , 2004, Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004..

[67]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[68]  David A. Patterson,et al.  RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform , 2006, ISPASS.

[69]  Mikko H. Lipasti,et al.  Redeeming IPC as a performance metric for multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[70]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[71]  Ming Wang,et al.  Hardware emulation for functional verification of K5 , 1996, DAC '96.

[72]  Fredrik Larsson,et al.  SimGen: Development of Efficient Instruction Set Simulators , 1997 .

[73]  Daniel A. Jiménez Idealized Piecewise Linear Branch Prediction , 2005, J. Instr. Level Parallelism.

[74]  Kevin Skadron,et al.  Memory Reference Reuse Latency: Accelerated Sampled Microarchitecture Simulation , 2002 .

[75]  Derek Chiou,et al.  FPGA-based Fast , Cycle-Accurate , Full-System Simulators , 2006 .

[76]  Karel Driesen,et al.  Accurate indirect branch prediction , 1998, ISCA.

[77]  Thomas M. Conte,et al.  Reducing state loss for effective trace sampling of superscalar processors , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[78]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[79]  Richard Uhlig,et al.  SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture , 1999 .

[80]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[81]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[82]  Rabin A. Sugumar,et al.  Multi-configuration simulation algorithms for the evaluation of computer architecture designs , 1993 .

[83]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[84]  Michel Dubois,et al.  The Design of RPM: An FPGA-based Multiprocessor Emulator , 1995, Third International ACM Symposium on Field-Programmable Gate Arrays.

[85]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[86]  Andy D. Pimentel,et al.  Distributed simulation of multicomputer architectures with Mermaid , 1997 .

[87]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[88]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[89]  Yale N. Patt,et al.  Alternative implementations of two-level adaptive branch prediction , 1992, ISCA '92.

[90]  Xidong Wang,et al.  Lock Behavior Characterization of Commercial Workloads , 2002 .

[91]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[92]  James E. Smith,et al.  Statistical simulation of symmetric multiprocessor systems , 2002, Proceedings 35th Annual Simulation Symposium. SS 2002.

[93]  Brad Calder,et al.  Loop Termination Prediction , 2000, ISHPC.

[94]  Martin Burtscher,et al.  The VPC trace-compression algorithms , 2005, IEEE Transactions on Computers.

[95]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[96]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[97]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[98]  Jakob Engblom,et al.  A FULLY VIRTUAL MULTI-NODE 1553 BUS COMPUTER SYSTEM , 2006 .

[99]  J. Robert Jump,et al.  The rice parallel processing testbed , 1988, SIGMETRICS '88.

[100]  Pierre Michaud,et al.  A PPM-like, Tag-based Predictor. , 2005 .

[101]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[102]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[103]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[104]  Gary Lauterbach Accelerating architectural simulation by parallel execution of trace samples , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[105]  Margaret Martonosi,et al.  Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques , 1999, IEEE Trans. Computers.

[106]  André Seznec,et al.  Analysis of the O-GEometric history length branch predictor , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[107]  Xiangyu Zhang,et al.  Whole Execution Traces , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[108]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[109]  Krste Asanovic,et al.  Branch trace compression for snapshot-based simulation , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[110]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[111]  M. Valero,et al.  A Novel Evaluation Methodology to Obtain Fair Measurements in Multithreaded Architectures , 2006 .