论文信息 - StarT-jr : a parallel system from commodity technology

StarT-jr : a parallel system from commodity technology

StarT jr is an experimental parallel system composed of a network of personal computers PCs The system leverages the momentum of the microprocessor and PC industries to achieve excellent single node performance at a low cost For parallel processing StarT jr uses the Flexible User level Network Inter face FUNi to provide low overhead user level interprocessor communication over two IEEE High Performance Serial Busses This e cient message passing mechanism enables StarT jr to exploit ne grained parallelism for good parallel performance FUNi is based on an embedded processing system on a PCI card Custom net work hardware assembled from a commercial IEEE chip set providesFUNi with access to the IEEE network In message passing FUNi s embed ded processor serves as a network coprocessor and manages an user accessible message passing interface in the host memory User level applications directly manipulate the interface location in host memory using cached reads and writes Costly physical I O accesses to device registers on the PCI bus are avoided Currently FUNi can e ciently support both ne grain message passing and direct memory to memory transfers of large data blocks FUNi can also sup port globally coherent shared memory by capturing and responding to memory accesses within a designated global address range FUNi maintains a globally coherent shared memory cache to minimize global memory access latency The necessary coherence protocol processing and communication is performed by the FUNi coprocessor We have demonstrated a two node prototype of StarT jr and are awaiting fab rication of additional interface cards in order to assemble an eight node system StarT jr currently supports an active message based light weight communi cation library for the C programming language Preliminary measurements of the communication library demonstrated overheads of sec for sending or receiving small bytes messages and an user to user latency of sec Direct memory to memory transfers can sustain MByte sec on an unloaded network With regard to the shared memory operation reading a shared memory location cached in FUNi takes approximately sec

Michael S. Ehrlich | M. Ehrlich

[1] Arvind,et al. T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[2] Anoop Gupta,et al. The Stanford FLASH multiprocessor , 1994, ISCA '94.

[3] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[4] Gregory G. Finn,et al. Atomic: A High-Speed Local Communication Architecture , 1994, J. High Speed Networks.

[5] G. Andrew Boughton. Arctic Routing Chip , 1994, PCRCW.

[6] James C. Hoe,et al. Network interface for message-passing parallel computation on a workstation cluster , 1994, Symposium Record Hot Interconnects II.

[7] Kai Li,et al. Two virtual memory mapped network interface designs , 1994, Symposium Record Hot Interconnects II.

[8] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[9] James R. Larus,et al. Tempest: a substrate for portable parallel programs , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[10] A. Chien,et al. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[11] James C. Hoe,et al. START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[12] Swanson,et al. Low Latency Workstation Cluster Communications Using Sender Based Protocols , 1996 .

[13] Seth Copen Goldstein,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.