Improved methods for hiding latency in high bandwidth networks (extended abstract)

In this paper we describe methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks. Our approach ]s similar in spirit to the “complementary slackness” method of latency hiding, but has the advantage that the slackness does not need to be provided by the programmer, and that large slowdowns are not needed in order to hide the latency. Onr approach is also similar in spirit to the latency hiding methods of [~], but is not restricted to memoryless dataflow types of programs. Most of our analysis is centered on the simulation of unit-delay rings on networks of workstations ( NOWS) with arbitrary delays on the links. For example, given any collection of operations (including updates of large local memories or databases) that runs in t steps on a ring of n workstations with unit link delays, we show how to perform the same collection of operations in O(t log3 n) steps on any connected, bounded-degree network of n / log3 n workstations for which the auerage link delay is constant. (Here we assume that the bandwidth available on the NOW links is O(log n) times the bandwidth available on the ring links. An extra factor of log n appears in the slowdown without this assumption. ) The result makes non-trivial use of redundant computation, which is required to avoid a slowdown that is proportional to the maxtmum link delay. The increase in memory and computational load on each workstation needed for the redundant computation is at most 0( 1). In the case where the average latency in the network of workstations (dave) is not constant, then the slowdown needed for the simulation degrades by an additional factor of 0( G). This is still far superior to a slowdown of @(dmaX) which can occur without redundant computation. As a consequence of our work on rinm. we can also derive emulati&s of a wide variety of o;h’er unit-delay network architectures on a NOW with high-latency links. For example, we show how to emulate an N-node 2dimensional array with unit delays, using slowdown s = 0( filog3 N + Nile log3 N&) on any connected bounded-degree network of O(N/s) workstations with average link delay dave. The emulation is work-preserving and the slowdown is close to optimal for many configurations of the network of workstations. We also prove lower bounds that establish limits on the degree to which the high latency links can be mitigated. These bounds demonstrate that it is easier to overcome latencies in dataflow types of computations than in computations that require access to large local databases. *Department of Mathematics and Laboratory for Computer Science, hlIT Supported by NSF contract 9302476 -CCR, ARMY grant DAAH04-95-1-0607 and ARPA contract NOOO1495-1.1246 Email andrewsdmath mlt edu t Department of Mathematics and Laboratory for COmputer Sc,ence, MIT Supported by ARMY grant DAAH0495-1-0507 and .4RPA contract NO O014-95-1-1246 Emall ftlQmath mlt eclu t Department of Computer Science, Wellesley college SUPported by NSF contract 9504421 -CCR, ARMY grant DAAH0495-1-0607 and ARPA contract NOOO14-95-1-1246 Emall. pmetaxas{~wellesley edu $Department of Mathematics and Laboratory for Computer Science, MIT Supported by an NSF graduate fellowship, ARMY grant DA.4H04-95-1-0607 and ARPA contract NoOO14-95-l1~46 Ema,l. ~lu@ma+h mlt ~du Permission to make digitallhard copies of all or part of this material for personal or clasaroom use is granted without fee provided that tbe copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copyright is by permission of the ACM, Inc. To copy otherwise, to republish, to peat on servers or to redistribute to lists, requires specific permission and/or fee. SPAA’96, Padua, Italy ‘@1996 ACM 0-89791-809-6/96/06 ..$3.50

[1]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[2]  L. W. Tucker,et al.  Architecture and applications of the Connection Machine , 1988, Computer.

[3]  Sivan Toledo,et al.  Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers , 1997, J. Comput. Syst. Sci..

[4]  W. F. McColl,et al.  Bulk synchronous parallel computing , 1995 .

[5]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[6]  Richard Cole,et al.  Multi-scale self-simulation: a technique for reconfiguring arrays with faults , 1993, STOC '93.

[7]  Arnold L. Rosenberg,et al.  Work-preserving emulations of fixed-connection networks , 1989, STOC '89.

[8]  Frank Thomson Leighton,et al.  Automatic methods for hiding latency in high bandwidth networks (extended abstract) , 1996, STOC '96.

[9]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[10]  Sivan Toledo,et al.  Efficient out-of-core algorithms for linear relaxation using blocking covers , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[11]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[12]  Yonatan Aumann,et al.  Computing with faulty arrays , 1992, STOC '92.

[13]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[14]  Anna R. Karlin,et al.  Asymptotically tight bounds for computing with faulty arrays of processors , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[15]  Bruce M. Maggs,et al.  On the fault tolerance of some popular bounded-degree networks , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[16]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[17]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.