Simultaneous Parallel Reduction on SIMD Machines

Proper distribution of operations among parallel processors in a large scientific computation executed on a distributed-memory machine can significantly reduce the total computation time. In this paper, we propose an operation called simultaneous parallel reduction(SPR), that is amenable to such optimization. SPR performs reduction operations in parallel, each operation reducing a one-dimensional consecutive section of a distributed array. Each element of the distributed array is used as an operand to many reductions executed concurrently over the overlapping array's sections. SPR is distinct from a more commonly considered parallel reduction which concurrently evaluates a single reduction. In this paper we consider SPR on Single Instruction Multiple Data (SIMD) machines with different interconnection networks. We focus on SPR over sections whose size is not a power of 2 with the result shifted relative to the arguments. Several algorithms achieving some of the lower bounds on SPR complexity are presented...