On the design and analysis of practical combinatorial algorithms for multiprocessor architectures

This dissertation presents a framework for the design and evaluation of high-level, platform independent algorithms which will execute efficiently across a variety of multiprocessor architectures. Our framework consists of a computational model for clusters of high performance nodes that is coupled with a rigorous empirical validation on a representative variety of current multiprocessors using carefully chosen benchmarks. This framework has enabled us to develop optimal algorithms for h-relations, sorting, and prefix computations for various subsets of this general cluster architecture. These problems were selected because of their substantial and irregular communications and memory access requirements. To efficiently solve these problems, we have introduced a number of novel techniques. For example, we have shown how the irregular global communication which is endemic to a large class of problems can be more efficiently implemented using either a deterministic or randomized scheme which requires only two rounds of balanced communication. We then show how the statistical properties generated by the first stages of these regular rearrangements can be leveraged to produce more effective algorithms. In another important result, we have demonstrated how attention to the number and type of main memory accesses, as opposed to a more traditional focus on computational complexity, is crucial to distinguishing algorithmic choices on hierarchical memory platforms. Achieving efficient results on these platforms requires a number of algorithmic modifications which are typically overlooked. Each of our algorithms was implemented in a high-level language and run on a representative selection of platforms. In each case, the algorithm was tested using a suite of benchmarks specifically created to examine the performance issues identified by our computational model. These included the volume and pattern of memory accesses and interprocessor communication. The results confirmed the importance of these issues and the validity and scalability of our solutions. Moreover, wherever possible, we confirmed the efficiency of our algorithms by comparing our experimental results with the best available data from the alternative algorithms.