Accelerating Multi-threaded Application Simulation through Barrier-Interval Time-Parallelism

In the last decade, the microprocessor industry has undergone a dramatic change, ushering in the new era of multi-/manycore processors. As new designs incorporate increasing core counts, simulation technology has not matched pace, resulting in simulation times that increasingly dominate the design cycle. Complexities associated with the execution of code and communication between simulated cores has presented new obstacles for the simulation of manycore designs. Hence, many techniques developed to accelerate uniprocessor simulation cannot be easily adapted to accelerate manycore simulation. In this work, a novel time-parallel barrier-interval simulation methodology is presented to rapidly accelerate the simulation of certain classes of multi-threaded workloads. A program delineated into intervals by barriers may be accurately simulated in parallel. This approach avoids challenges originating from unknown thread progressions, since the program location of each executing thread is known. For the workloads tested, wall-clock speedups range from 1.22× to 596×, with an average of 13.94×. Furthermore, this approach allows the estimation of stable performance metrics such as cycle counts with minimal losses in accuracy (2%, on average, for all tested workloads). The proposed technique provides a fast and accurate mechanism to rapidly accelerate particular classes of manycore simulations.

[1]  Lieven Eeckhout,et al.  NSL-BLRL: efficient cache warmup for sampled processor simulation , 2006, 39th Annual Simulation Symposium (ANSS'06).

[2]  Yuan Cheng,et al.  P-GAS: Parallelizing a Cycle-Accurate Event-Driven Many-Core Processor Simulator Using Parallel Discrete Event Simulation , 2010, 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[3]  Thomas M. Conte,et al.  Reducing state loss for effective trace sampling of superscalar processors , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[4]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[5]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[6]  Gary Lauterbach Accelerating architectural simulation by parallel execution of trace samples , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[7]  Tobias Kiesling,et al.  Time-parallel simulation with approximative state matching , 2004, 18th Workshop on Parallel and Distributed Simulation, 2004. PADS 2004..

[8]  Thomas M. Conte,et al.  Combining cluster sampling with single pass methods for efficient sampling regimen design , 2007, 2007 25th International Conference on Computer Design.

[9]  J. Robert Jump,et al.  The rice parallel processing testbed , 1988, SIGMETRICS '88.

[10]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[11]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[12]  Alaa R. Alameldeen,et al.  Addressing Workload Variability in Architectural Simulations , 2003, IEEE Micro.

[13]  Krste Asanovic,et al.  Accelerating Multiprocessor Simulation with a Memory Timestamp Record , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[14]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[15]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[16]  Tobias Kiesling Approximate time-parallel cache simulation , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[17]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[18]  Susan J. Eggers,et al.  Static Analysis of Barrier Synchronization in Explicitly Parallel Programs , 1994, IFIP PACT.

[19]  Mahmut T. Kandemir,et al.  Exploiting barriers to optimize power consumption of CMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[20]  Albert Cohen,et al.  DiST: a simple, reliable and scalable method to significantly reduce processor architecture simulation time , 2003, SIGMETRICS '03.

[21]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Chen Ding,et al.  Scalability and Data Placement on SGI Origin , 1998 .

[23]  Wen-mei W. Hwu,et al.  Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.

[24]  John Flynn,et al.  Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research , 2001 .

[25]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[26]  Lieven Eeckhout,et al.  BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation , 2005, Comput. J..

[27]  Alejandro Duran,et al.  Trace-driven simulation of multithreaded applications , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[28]  Thomas M. Conte,et al.  Reverse State Reconstruction for Sampled Microarchitectural Simulation , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[29]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[30]  Thomas F. Wenisch,et al.  TurboSMARTS: accurate microarchitecture simulation sampling in minutes , 2005, SIGMETRICS '05.

[31]  Lieven Eeckhout,et al.  Distilling the essence of proprietary workloads into miniature benchmarks , 2008, TACO.

[32]  Brad Calder,et al.  A co-phase matrix to guide simultaneous multithreading simulation , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[33]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[34]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[35]  John Paul Shen,et al.  Calibration of Microprocessor Performance Models , 1998, Computer.

[36]  Thomas F. Wenisch,et al.  Statistical sampling of microarchitecture simulation , 2006, IPDPS.

[37]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.