Optimization of Parallel Discrete Event Simulator for Multi-core Systems

Parallel Discrete Event Simulation (PDES) can substantially improve performance and capacity of simulation, allowing the study of larger, more detailed models, in shorter times. PDES is a fine-grained parallel application whose performance and scalability are limited by communication latencies. Traditionally, PDES simulation kernels use processes that communicate using message passing, shared memory is used to optimize message passing for processes running on the same machine. We report on our experiences in implementing a thread-based version of the ROSS simulator. The multithreaded implementation eliminates multiple message copying and significantly minimizes synchronization delays. We study the performance of the simulator on two hardware platforms: a Core i7 machine and a 48-core AMD Opteron Magny-Cours system. We identify performance bottlenecks and propose and evaluate mechanisms to overcome them. Results show that multithreaded implementation improves performance over the MPI version by up to a factor of 3 for the Core i7 machine and 1.2 on Magny-cours for 48-way simulation.

[1]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Kalyan S. Perumalla Scaling time warp-based discrete event execution to 104 processors on a Blue Gene supercomputer , 2007, CF '07.

[3]  F. Wieland,et al.  Limitation of optimism in the time warp operating system , 1989, WSC '89.

[4]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, Proceedings Fourteenth Workshop on Parallel and Distributed Simulation.

[5]  Vivek Sarkar,et al.  Compile-time partitioning and scheduling of parallel programs , 1986, SIGPLAN '86.

[6]  M. J. Quinn,et al.  Parallel Computing: Theory and Practice , 1994 .

[7]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[8]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[10]  Christopher D. Carothers,et al.  Scalable Time Warp on Blue Gene Supercomputers , 2009, 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation.

[11]  Peter Martini,et al.  A Flexible Dynamic Partitioning Algorithm for Optimistic Distributed Simulation , 2007, 21st International Workshop on Principles of Advanced and Distributed Simulation (PADS'07).

[12]  Nael B. Abu-Ghazaleh,et al.  Using programmable NICs for time-warp optimization , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  Nael B. Abu-Ghazaleh,et al.  Optimizing communication in time-warp simulators , 1998, Workshop on Parallel and Distributed Simulation.

[14]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[15]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[16]  Philip A. Wilsey,et al.  Adressing Comminication Latency Issues on Clusters for Fine Grained Asynchronous Applications - A Case Study , 1999, IPPS/SPDP Workshops.

[17]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[19]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[20]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[21]  William J. Dally,et al.  Research Challenges for On-Chip Interconnection Networks , 2007, IEEE Micro.

[22]  Ganesh Gopalakrishnan,et al.  Design and Evaluation of the Rollback Chip: Special Purpose Hardware for Time Warp , 1992, IEEE Trans. Computers.

[23]  Richard M. Fujimoto,et al.  Computing global virtual time in shared-memory multiprocessors , 1997, TOMC.

[24]  Richard M. Fujimoto,et al.  GTW: a time warp system for shared memory multiprocessors , 1994, Proceedings of Winter Simulation Conference.

[25]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Sushmitha P. Kini,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[27]  Carl Tropper,et al.  A Design-Driven Partitioning Algorithm for Distributed Verilog Simulation , 2007, 21st International Workshop on Principles of Advanced and Distributed Simulation (PADS'07).

[28]  Nael B. Abu-Ghazaleh,et al.  Optimizing Message Delivery in Asynchronous Distributed Applications , 1999, Euro-Par.

[29]  Sajal K. Das,et al.  A dynamic load balancing algorithm for conservative parallel simulations , 1997, Proceedings Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[30]  Gregory R. Andrews,et al.  Foundations of Multithreaded, Parallel, and Distributed Programming , 1999 .

[31]  R. Fujimoto,et al.  Buffer management in shared-memory time warp systems , 1995, Proceedings 9th Workshop on Parallel and Distributed Simulation (ACM/IEEE).

[32]  Nael B. Abu-Ghazaleh,et al.  Early cancellation:an active NIC optimization for time-warp , 2002, Proceedings 16th Workshop on Parallel and Distributed Simulation.

[33]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..