Parallelization of nonuniform loops in supercomputers with distributed memory

A template algorithm for parallel execution of independent iterations of the repetitive loop on a multiprocessor computer with distributed memory is constructed. Regardless of the number of processors, the algorithm must provide efficient utilization of computing capacity under essentially different complexities of iterations and/or performance of processors. The interprocessor data communication and control of parallel computations are assumed to be implemented using a standard message-passing interface (MPI), which is widely used in such systems. Existing methods for the loop parallelization are analyzed and the corresponding efficiencies are empirically estimated for various models of iteration nonuniformity.