论文信息 - Parallel execution of radix sort program using fine-grain communication

Parallel execution of radix sort program using fine-grain communication

The report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication-this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. The authors study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. The experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as one increases the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.

[1] Toshitsugu Yuba,et al. An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[2] Shuichi Sakai,et al. Design and Implementation of a Circular Omega Network in the EM-4 , 1993, Parallel Comput..

[3] Guy E. Blelloch,et al. A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[4] Keshav Pingali,et al. I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[5] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6] Mitsuhisa Sato,et al. Message-based efficient remote memory access on a highly parallel computer EM-X , 1994, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN).

[7] Anoop Gupta,et al. Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[8] T. Yuba,et al. An architecture of a dataflow single chip processor , 1989, ISCA '89.

[9] Yoichi Koyanagi,et al. AP1000+: architectural support of PUT/GET interface for parallelizing compiler , 1994, ASPLOS VI.

[10] Mitsuhisa Sato,et al. The EM-X parallel computer: architecture and basic performance , 1995, ISCA.

[11] Hiroshi Nakashima,et al. Overview of the JUMP-1, an MPP prototype for general-purpose parallel computations , 1994, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN).

[12] Seth Copen Goldstein,et al. TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[13] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[14] Mitsuhisa Sato,et al. Thread-based programming for the EM-4 hybrid dataflow machine , 1992, ISCA '92.

[15] W. Daniel Hillis,et al. Data parallel algorithms , 1986, CACM.