Performance of All-to-All on QsNetII

In this paper we study the performance of the all-to-all collective on QsNet. We compare the effectiveness of a range of algorithms on both small and large element sizes. Results from 1024 node clusters show low latency, a high percentage of peak bandwidth and good scalability. Introduction Many scientific applications exhibit the need for communication patterns that involve global data movement and global control [1]. One frequently used collective is all-to-all message exchange in which each of p processes transmits a distinct message to itself and each of the other p-1 processes. All-to-all communication is used in fast fourier transform (FFT) and matrix transpose. Optimisation of these operations can substantially improve the performance and the resource utilization of a parallel computer. QsNet is the latest generation of Quadrics interconnect [2, 3], it consists of two ASICs: Elan4 and Elite4. The Elan4 communication processor forms the interface between a processing node and a high performance multistage network. It has a 64 bit internal architecture and supports 64 bit virtual addresses. The Elan4 generates and accepts packets to and from the network. In addition, it provides local processing power to implement the high-level message passing protocols required in parallel processing. The network is constructed from Elite4 switch components that are capable of switching eight bi-directional communications links. Each communications link carries data in both directions simultaneously at 1.3 Gbytes/sec. The link bandwidth is shared between two virtual channels. The network supports point-to-point transfer between arbitrary nodes and broadcast across selected ranges of nodes. QsNet The main features of QsNet are low latency, high bandwidth, scalability, reliable transmission and a commodity host adapter interface. Many scientific applications are very sensitive to the MPI [4] communication latency. The Elan adapter minimizes this latency by providing specialized units to quickly pipeline small messages into the network, perform protocol processing and notify the completion of the communication primitive. Its pipelined DMA engine uses split transaction reads to maximize host adapter bandwidth. QsNet is designed to scale to thousands of nodes, both in terms of hardware capability and system software design – QsNet is used in Thunder at Lawrence Livermore National Lab (the fifth most powerful supercomputer in the world at the time of writing [5]) and mpp2 at Pacific Northwest National Lab, both 1024-node systems with IA64 CPUs. QsNet implements a reliable transmission protocol in hardware, and is able to detect faults, route packets around faulty switches and re-transmit packets in the presence of data errors. Performance of all-to-all on QsNet 24 April 2005 2 Figure 1: Elan4 Functional Units The Elan4 network adapter contains the following functional units, as outlined in Figure 1. • A 1.3 Gbyte/sec each way network interface connection • A pipelined DMA engine • A 64-bit microprocessor, the Thread Processor. • An MMU used to translate 64-bit virtual addresses into either local SDRAM physical addresses or 64-bit physical addresses for PCI-X master address • A 64-bit pipelined SDRAM interface with full ECC logic, connected to 64 Mbytes of adapter SDRAM. • 32KB 4-way set associative cache with multiple read and write ports. • A command processor that defines a virtual command queue interface. This provides a virtual interface to the programmer giving the freedom to define multiple, independent, low latency, command queues. • A short message-processing unit called STEN (Small Transaction Engine). • A PCI-X, 64-bit, 133 MHz host interface. User processes can perform remote read/write memory operations by issuing DMA commands to the Elan4. The DMA engine services a queue of outstanding DMA requests and ensures that they are completed reliably. The DMA engine can handle arbitrary source and destination buffer alignment as well as endian conversions. In addition, there are facilities to issue broadcasts and queued DMAs. The DMA engine processes 2 DMAs concurrently to overlap the startup/finish latency and maintain full PCI-X read bandwidth. The Short Transaction Engine (or STEN) generates network packets directly. The main CPU or the Elan thread processor can, for example, initiate a single word remote put simply by writing the address, data value and destination virtual Performance of all-to-all on QsNet 24 April 2005 3 process id to a command queue. The STEN will then manage the transfer. The STEN is used for short transfers where the overhead of writing a DMA descriptor across the PCI bus is high. Its use results in increased issue rates for short puts and gets. The completion of a data transfer (DMA or STEN packet) can be signalled by setting of an event in both source and destination processes. Events can be a simple word in memory (allowing a process to poll them), but the event engine allows a number of more sophisticated operations to be performed. A main processor interrupt can be generated (allowing a process to poll an event for some period of time and then sleep in the device driver) or a copy can occur. Event copies are of particular interest, as they can be used to initiate further DMAs. The Elan adapters connect each node to a multi-stage switch network constructed from Elite switches. Each switch is an 8×8 bi-directional full crossbar with 2 virtual channels per link. Networks are constructed using in a fat tree topology, with 4 links down to the nodes and 4 links up from each switch. Figure 2: 128-way QsNet switch All 8 links can connect down at the top stage allowing us to build networks with 4×8 ports as shown in Table 1. Stages Switches Ports 1 8 8 2 4×8 32 3 4×4×8 128 4 4×4×4×8 512 5 4×4×4×4×8 2048 Table 1: QsNet Switch Topologies Switches are packaged in a variety of configurations, 8, 32 and 128-way standalone switch modules provide low cost entry-level systems for small and medium size clusters. Larger networks are constructed using our federated switch Performance of all-to-all on QsNet 24 April 2005 4 architecture. For example, a 2048 port network (as shown in Figure 3) is constructed by connecting 32 64-way node switches with 64 32-way top switches. Figure 3: QsNet 2048 port Network This network maintains full bi-section bandwidth. In fact the bi-section bandwidth of the network exceeds the host adapter bandwidth by a factor of 2; the links are bi-directional. Reduced bandwidth federated networks are available at reduced cost. QsNet networks provide many routes between any pair of nodes. The cross point switches adaptively route packets up the tree selecting lightly loaded errorfree links at each stage. The Elite4 switch used to construct QsNet networks supports broadcast operations in hardware. A DMA packet input on one link can be sent on to a range of output links. Broadcast packets are routed up the broadcast tree (any one of the many trees in the QsNet network) to a point high enough that all destination nodes can be reached, then down to all of the nodes in the range. Acknowledgements are combined back up this tree and a single success or failure token is returned to the source. This mechanism allows data to be sent to all nodes in much the same time as it can be sent to any one. The network’s ability to combine acknowledgements is also used to implement network conditionals. A process can test the value of a memory location across a range of nodes, returning true if it is true in all of them and false if it is false in any of them. The Quadrics software stack includes both MPI and Shmem [6] interfaces. These libraries are implemented with libelan, which provides inter-process communication primitives and libelan4, the device specific command issue library. The interface between MPI and libelan is device independent, allowing the same MPI library to be used for both Elan4 adapters and the older Elan3 adapters Performance of all-to-all on QsNet 24 April 2005 5 installed in many AlphaServer SC and Linux clusters. Dynamic libraries are used throughout, allowing the same user binary to run on different systems of either generation. A number of basic point-to-point operations are used in the implementation of our all-to-all collective. Of particular interest are the elan_put()functions. void *elan_put(ELAN_STATE *state, void *source, void *dest, size_t size, u_int destvp); void *elan_doput(ELAN_PGCTRL *pgctrl, void *source, void *dest, ELAN_ADDR devent, size_t size, u_int destvp, int rail); Both functions transfer size bytes of data from the source process to the destination, identified by destvp (also refered to as the rank or processing element). A broadcast is performed using a pre-assigned broadcast virtual process id. An opaque pointer is returned, it can be passed to elan_wait() to wait on or test for the completion of the transfer. The STEN is used if the transfer size is below a programmed threshold, otherwise a DMA is issued. The elan_put() function will stripe large messages over multiple rails where available. The elan_doput() function sets a destination event on completion. Many put operations (thousands) can be outstanding at any point in time. The Elan library provides global allocators for main memory and adapter SDRAM. The destination event passed to elan_doput() will typically have been allocated in advance in adapter SDRAM using such an allocator. For details of the Elan library [7] see www.quadrics.com/documentation. MPI All-to-All Functions MPI provides an application interface to the regular all-to-all operation (in which all elements are the same size) and the irregular operation (in which each of the size of each element can vary). int MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm); The communicator defines the set of processes involved in the collective; this can include all pr

[1]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[2]  Fabrizio Petrini,et al.  Hardware- and software-based collective communication on the Quadrics network , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[3]  Sayantan Sur,et al.  Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..