The TRANSPOSE machine-a global implementation of a parallel graph reducer

A new concept is described for the parallel implementation of functional languages on a network of processors. The implementation uses a special variant of annotated graph reduction. Active waiting is employed to avoid complicated runtime data structures. A global address space is used along with a random distribution of the graph nodes over the local memories of the processors, in order to overcome the problems of load-balancing and scheduling. The reduction is organized in cycles during which all annotated redices are reduced. This notion of 'cycles' permits the authors to restrict communication between the processors to the execution of a global permutation, defined by an array of messages. This 2-D permutation is realized by a simple and fast algorithm. This algorithm actually maps any 2-D permutation to a double 2-D transpose operation. Hence the implementation can be used for any network topology that supports the transpose operation (namely shuffle exchange). The potential speedup of graph reduction programs is compared with the overhead of the implementation, giving deeper insight into parallel graph reductions.<<ETX>>