SIMULATION FOR THE EPPP PROJECT

The Environment for Portable Parallel Programming (EPPP) project aims to empower parallel programmers with the ability to write applications which can be efficiently executed on various distinctly different computing platforms with relative ease. One part of this ambitious undertaking is the provision of a generalpurpose simulator which can simulate various parallel machines on conventional workstations. Targeted architectures for simulation include an Alex Informatique AVX series 1 (a transputer system), DEC’s MPP12000 (a SIMD machine) and a network of IBM RS/6000s (a MIMD parallel computer). A major goal of this simulator is to provide adequately accurate results within acceptable response times. We hope that this goal can be achieved via the use of a technique called ompiled simulation in which the application code is augmented with cycle counting instructions and routines for maintaining global time ordering between processes, before being compiled on the host machine for simulation. A major difference between our use of the compiled simulation technique and those in extant simulators is that the architecture to be simulated does not necessarily have to have a similar instruction set as that of the host machine. Such complications will be addressed in this paper. Introduction Writing programs to be executed on a parallel computer is a non-trivial task that is generally more errorprone than writing code for a uniprocessor machine; the coordination of parallel processes working in concert to solve a single problem adds an extra dimension to the programmer’s concerns. Coordination may include synchronization, scheduling and resource management, all of which may require interprocessor communications. Furthermore, expected performance gains over sequential execution may not be realized due to the programmer’s lack of knowledge of the underlying parallel execution platform. Lastly, once the programmer has fully debugged and tuned a parallel program for a particular target machine, the program becomes non-portable, that is, porting the program to another distinctly different parallel machine may require substantial re-writes due to the machine-specific optimizations the programmer has introduced. TheEnvironment for Portable Parallel Programming (EPPP) project currently being conducted at the Centre de recherche informatique de Montréal (CRIM), McGill University, Concordia University and the Université de Montréal attempts to provide an environment where a programmer can write, debug and tune a parallel program with relative ease. The environment will be portable in the sense that the user will be able to port an application to a variety of parallel architectures within the same development framework, and the environment itself can be hosted on conventional engineering workstations such as SPARC, RS/6000 and other platforms. In the EPPP, the program which is specified in a high-level language will be combined with application domain-specific and architecture-specific information in the compilation phase to produce object code for the target machine. Such information can be either supplied by library routines and information tables provided in the environment, or be specified by the user. Target parallel architectures of our environment include an Alex Informatique AVX transputer system, a DEC MPP12000 SIMD computer and a workstation farm from IBM which consists of RS/6000 nodes. In the simulation tool of EPPP, a high-level program would first be compiled with the target architecture in mind, i.e., all optimizations pertinent to the target architecture will be performed on the code. This allows for a more accurate estimation of the cost of an instruction’s execution. These estimations will be incorporated into appropriate locations in the program – basically cycle counting instructions and management code for regulating simulated executions of parallel code are inserted – before code for the simulation host is produced. The code for the simulation host is then executed to obtain performance results. In this paper, we will concentrate on the simulation aspects of the EPPP project. First, we will briefly review some existing simulators, in particular, we will examine the Proteus simulator [1, 2, 3] from MIT in detail. Then, we will give more details of our simulator which is based on the Proteus framework. Our simulator will advance the Proteus work by allowing architectures with different instruction sets to be more accurately simulated. Furthermore, our simulator is also tightly integrated into our environment, providing functional and performance debugging support. Finally, we will conclude this paper by outlining our future work. Background Conventional simulators are generally based on a cycle-by-cycle simulation of an architecture where instructions of the target machine are simulated at the functional block level. Although they are very accurate and are thus useful in predicting performances of novel architectures, they have proved less useful in code development because of their generally slow response times. One way to improve the speed is to no longer simulate the target machine instruction by instruction, but to execute a compiled version of the code directly on the simulation host and time it with the host’s clock [4, 5]. The problem with this approach is the accuracy and granularity of most workstations clocks. For example, times are often truncated to the nearest millisecond. Within a millisecond, thousands of instructions may have taken place or tens to hundreds of interprocessor communication messages may have occurred. Another problem is that the code added to control the simulation, code that would not be executed on the actual target architecture, is also timed and affects the overall execution time. For these major reasons, timing with the simulation host’s clock can only give a rough estimate of actual run times of the target architecture. Another approach is to use code augmentation. Augmentation was first used by Threads [6], refined in RPPT [7], and perfected in Tango [8] and Proteus. Augmentation will be described in more detail in the next section. Tango and Proteus were developed independently and have come to be very similar in nature. In our project, we have chosen Proteus as a starting point because Tango uses Unix processes to simulate parallel execution while Proteus uses faster, lightweight processes managed by the simulation engine. The difference in simulation time is significant. However, there is now a new version of Tango which has lightweight processes, but we will not examine it in this paper. PROTEUS Proteus multiplexes a single processor to simulate parallel architectures. The simulator is composed of three main components. First, a user interface is available to select the architecture to be simulated. Proteus can simulate a wide range of MIMD architectures. The interface is also used to debug the program at execution time, and to visualize results and the behavior of an execution. The user can analyze processor utilization, communications and code profiling (timing of procedures, number of calls, etc.). The second component is responsible for augmenting the compiled code. Once the architecture is specified, augmentation is used to add code to the user’s application to time and guide the simulation. The user code is first compiled into the simulation host’s assembly language, then, it is divided into basic blocks. A basic block is delimited either by a jump or call instruction, or an instruction to which others can branch to. Inside every block, each instruction is assigned a cycle count – the number of processor cycles it would take to execute the instruction on the target processor – via a table look-up mechanism. The number of cycles of the instructions are then summed and an instruction updating a cycle counter is added at the end of the block. Finally, a conditional jump to the simulator engine is also added at the end of the basic block such that the execution order of the blocks can be maintained; blocks in a parallel machine execute concurrently and global ordering of all block executions are maintained by the simulation engine. That is, after each execution of a basic block, the simulation engine checks which block should be executed next by examining the calculated starting times of all the executable blocks and selecting the block with the lowest starting time. To limit the number of context switching to another executable block, jumping to the simulator is not done after every basic block, but only conditionally after a minimum amount of time since the last context switch. In the code augmentation stage, a directive is also available to disable augmentation. Augmentation can be turned off to have code which one doesn’t want to cycle count. It can be useful to add non intrusive monitoring and debugging code. The last and major component of Proteus is the simulation engine. When the user selects a specific MIMD architecture, he basically modifies parameters of the engine. A specific engine is compiled and then linked with the user’s application for the architecture selected. The final program interleaves the execution of the user’s application with the simulator. When a user block finishes execution, it updates the cycle counter of its simulated processor and gives control to the simulator. The simulator then executes the youngest available block according to the global simulated time. Interprocessor communications are also handled by the engine. When the user program makes a communication, it really calls the simulator which ‘routes’ the message to the proper processor. The destination processor will receive the message at a global time determined by the simulator using the user’s defined architecture. For even faster simulation of communications, the simulator can optionally use a mathematical model to determine the time needed to route the message. Proteus has