This paper presents the implementation of a system called Net* that provides a parallel programming environment running on a group of personal computers interconnected via a dedicated local area network. Net* is designed to run on hardware consisting entirely of inexpensive “off-the-shelf” components, including a number of popular networking technologies. Its implementation depends on a number of operating system modifications to reduce overhead of network I/O, and a lightweight protocol design that utilizes these modifications to provide high-speed, reliable, in-order message streams on a local area network. Unlike other recent work, in our implementation protocol processing is kept in the kernel in order to provide security and efficiency on today’s commercially available network adaptors. I/O directly to/from buffers mapped into both user and kernel space eliminates copying and reduces the need for explicit system calls. Preliminary measurements indicate low per-message latency is possible. INTRODUCTION This paper presents the implementation of a system called Net* that provides a parallel programming environment consisting of the C* data-parallel language [4] running on a group of personal computers interconnected via a dedicated local area network. This implementation required four components: 1. a control program and associated daemon that allow users to dynamically configure a parallel machine from a collection of personal computers and to control the execution of programs on that configuration [1]; 2. a communications library that is invoked from code generated by the C* compiler to perform all network I/O [6]; 3. a series of modifications to the operating system to decrease the latency and increase the throughput of network I/O [5] [9]; 4. a light-weight protocol that is utilized by the communications library and in turn utilizes the operating system modifications to provide a high-speed reliable alternative to TCP/IP in a local area network [7] [9]. The equipment for this project consisted of a number of Intel Pentium 133 MHz processors, each with 32 Megabytes of memory and a 2 Gigabyte fast, wide SCSI disk. These systems were controlled by the Linux operating system, version 1.2.13. Each system had at least two network interface cards: one attached to the local LAN for connectivity to the global internet, and one attached to another LAN dedicated exclusively to use by the parallel programming application. For this dedicated network various common LAN technologies were connected, including 10 Mbps ethernet, 100 Mbps (fast) ethernet, and 100 Mbps VG-AnyLAN. Development is also underway for 155 Mbps ATM. On machines equipped with more than one additional network interface, the user dynamically selects the appropriate one by symbolic name. The main thrust of this paper is a discussion of points 3 and 4—the operating system modifications and the protocol design that takes advantage of them. Unlike several other recent projects with similar goals, we have addressed the reality of keeping protocol processing in the kernel in order to provide security and efficiency on today’s commercially available network adaptors. The key factors in Net* are a secure shared-buffer interface between users and kernel that avoids all kernel calls on input and permits some savings on bursty output, a low-cost timer design to efficiently control message acknowledgement and retransmission, and a protocol designed to minimize the delay and number of retransmissions after errors. In the following sections we discuss our changes to the operating system, present our new protocol design, give performance results, compare our system with related work and discuss ongoing development. OPERATING SYSTEM MODIFICATIONS In a recent paper, Druschel [3] presents a number of techniques that can be used in an operating system to provide better support for high-speed networking. These techniques are intended to eliminate the following three bottlenecks to performance: excess data copying; inappropriate scheduling of I/O; required kernel intervention in all I/O. Our design addresses these as well as a fourth bottleneck mentioned by Jacobson and co-workers [2]: inappropriate timer facilities. User API We provide a system call that the user invokes once to initialize its connection with the local area network. In response to this call, the kernel allocates two pools of buffers, one pool for input and a second pool for output. These buffers are intended to hold the network frames themselves, and are allocated in kernel space so that any restrictions due to physical memory addressing requirements of a network interface card (NIC) can be satisfied. However, they are also mapped into the virtual address space of the user, so that they can be directly addressed by the user without kernel intervention. This mapping is done once for all buffers, and remains in effect until the user process terminates its network connection. The size and number of buffers is specified by the user, subject to certain limitations. As part of allocating the buffer pools, the kernel also creates and maps into both kernel and user space a set of queues and a set of pointers into the queues. There are two queues for each pool, and two pointers for each queue. The queues are called the empty queue and the full queue, and are implemented as fixed sized linear arrays containing one element for each buffer in the associated pool. Initially each slot in the empty queue is filled with a pointer to one of the buffers, and each slot in the full queue is set to zero. The two pointers into each queue are called the head and tail pointers, and they utilize the queue array in a circular fashion. Use of Shared Buffers Once the buffer pools are initialized, interaction between the kernel and the user code fits the classic producer-consumer paradigm in each direction. The user process is the producer and the kernel is the consumer for outgoing packets. The roles are reversed for incoming packets. Since Net* dedicates the second network to one parallel job, there is only one user process accessing the pools at any time. No locking mechanism is necessary in order to ensure mutual exclusion to the shared-buffer data structures because of three facts: the kernel and the user process each read and write a distinct pointer when inserting or deleting buffers to or from a queue; the queues are circular arrays containing exactly one slot for each buffer in the pool; the queue slots pointed to by the reference pointers contain a 0 if the slot is empty and a non-zero buffer address otherwise. For incoming data the NIC generates an interrupt when an incoming packet arrives. The kernel responds and reads the packet into a buffer in the input pool. If the packet arrives before the user-level code is ready for it, then it will be waiting in the input full queue when the user program next looks. If the user-level code tries to read a packet before it arrives, then the user code can either poll the full queue until a packet shows up or it can wait for the packet by informing the kernel that it wishes to suspend execution until the packet arrives. Therefore a system call is only performed in this latter case. For outgoing messages the user-level code simply places the data into a shared buffer, adds the buffer to the output full queue, and continues processing. At this point the user tests a flag associated with the full queue to see if the kernel device driver for the associated NIC is busy with a previous transmission. If it is, then nothing further needs to be done because when the NIC interrupts the kernel at the completion of the previous transmission, the kernel will find the new full buffer in the queue and will transmit it. Only if the driver is quiescent does the user program have to issue a system call to force the kernel to look in the full queue for the next buffer to deliver to the NIC. Clearly if the user process can produce output information fast enough, no system call overhead will be necessary. In many cases output is produced in bursts, so that a system call will be needed to wake up the kernel driver only for the first buffer in the burst. An alternative approach to testing this “busy” flag and issuing a system call to “wake” the kernel driver would be to simply have the kernel periodically poll the full queue to see if anything were waiting to be sent to the NIC. The kernel could poll the queue whenever an interrupt occurs, but the only interrupt that is guaranteed to occur on a regular basis is the clock tick interrupt. The design tradeoff is, therefore, the overhead of issuing the system call versus the dead time lost between clock ticks, which occur only once every 10,000 microseconds on a standard Pentium processor. Since the difference is on the order of a factor of 1,000, it is obvious that only polling on clock ticks would often leave the network idle for unacceptably long periods of time. It is clear that this buffering scheme eliminates virtually all unnecessary data copying, since the NIC and the user process read and write data from exactly the same memory locations. It is also clear that we have not eliminated kernel intervention on any of the network I/O; we have simply eliminated the need for the user process to issue system calls on all input operations and on some output operations (those that occur “soon enough” after the previous output operation). In particular, we have not granted the user process direct access to the device registers of the NIC. We consider such direct access to be a potential security loophole, since most of today’s commercially available network adaptors are designed to permit access to either no functionality or full functionality—there is no way to grant selective access to just certain functions. Furthermore, without being able to program the NIC, kernel intervention to handle interrupts from the NIC is almost certainly required, especially to de
[1]
P. Druschel,et al.
Operating system support for high-speed communication : Latest developments in operating systems
,
1996
.
[2]
David Clark,et al.
An analysis of TCP processing overhead
,
1989
.
[3]
Thorsten von Eicken,et al.
Low-Latency Communication over Fast Ethernet
,
1996,
Euro-Par, Vol. I.
[4]
Thorsten von Eicken,et al.
U-Net: a user-level network interface for parallel and distributed computing
,
1995,
SOSP.
[5]
Anthony J. Lapadula,et al.
A Retargetable C* Compiler and Run-time Library for Mesh-Connected MIMD Multicomputers
,
1992
.
[6]
Peter Druschel,et al.
Operating system support for high-speed communication
,
1996,
CACM.
[7]
Scott Pakin,et al.
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet
,
1995,
Proceedings of the IEEE/ACM SC95 Conference.