The Influence of Parallel Decomposition Strategies on the Performance of Multiprocessor Systems

We present a model for predicting multiprocessor performance on iterative algorithms, where each iteration consists of some amount of access to global data and some amount of local processing. The application cycles may be synchronous or asynchronous, arid the processors may or may not incur waiting time, depending on the relationship between the access time and processing time, The amount of processing time and global data accesses incurred by the parallel processes depends upon characteristics of the algorithm and its decomposition. We study the decompositions.of several sample algorithms, and identify several decomposition groups. Finally, using the Poisson partial-differential equation algorithm as an example, we investigate how its decomposition affects its performance. 1 In t roduct ion The value of a computer system is measured by the work it can perform. Parallel processing opens a new horizon for handling targe workloads, but its promise can only be realized if multiprocessots can be built and programmed to take full advantage of the potential parallelism. The computer-system designer must choose from a wide range of design alternatives--far too many to test empirically. Good performance models are needed to provide a'starting place for empirical investigation. In an earlier paper, 1 we presented a model which measures the cyclic processing power that a multiprocessor can apply to a v.,orldoad with given characteristics. Unlike other models. 2' 3, 4, s it is not based on predicting statistical mean values for performance over some time interval. Rather, it is based on the fact that most algorithm are iterative in nature, and the iterations tend to perform approximately the same amount of processing and data access. Let us assume that a given algorithm consumes Tp units of processing time each iteration, and spends T a time units accessing global data. The ratio of processing.to-access time depends upon the algorithm, and will be denoted by T X= P T a When an algorithm is decomposed into parallel processes, an iteration by c. subprocess u.~ually takes less time than an iteration in its uniprocessor counterpmt. Let tp and t a be the processing and access times within a subprocess iteration. If we let N > 1 denote the number of processors engaged in a parallel decomposition, usually tp < Tp ~.nd t a < T a, but often it is not true that N tp = Tp or N t a = T a. ]'he change in tp a~d t,~ as processors are added is also characteristic of the algorithm. Let us define the decomposition functions fp and fa as T T f = p f = a P t a t p a The time for a single iteration of a uniprocessor algorithm is simply T¢ = Tp + T a. However, due to contention for global data, the iteration time for a subprocess in a parallel implementation also depends on a nonnegative waiting time t w. Thus, the cycle time is t c = tp + t= + t w. Both the decomposition functions and the waiting time influence the speed of a multiprocessor implementation, usually preventing an N-processor decomposition from finishing in 1/Nth the time needed by the uniprocessor version. Wi~ shall define the speedup SP as a ratio of cycle times: