Communication and fault tolerance in parallel computers

This thesis explores two fundamental issues in the design of large-scale parallel computers: communication and fault tolerance. In Chapter 1, we introduce and provide motivation for the problems we study in this thesis. Chapter 2 examines several simple algorithms for routing packets on butterfly networks with bounded queues. Among other things, we show that for any greedy queuing protocol, a routing problem in which each of the N inputs sends a packet to a randomly chosen output requires O(log N) steps, with high probability, provided that the queue size is a sufficiently large, but fixed, constant. In Chapter 3, we analyze the fault-tolerance properties of several bounded-degree hypercubic networks that are commonly used for parallel computation. Among other things, we show that an N-node butterfly containing $N\sp{1-\epsilon}$ worst-case faults (for any constant $\epsilon>0)$ can emulate a fault-free butterfly of the same size with only constant slowdown. Similar results are proved for the shuffle-exchange graph. Hence, these networks become the first connected bounded-degree networks known to be able to sustain more than a constant number of worst-case faults without suffering more than a constant-factor slowdown in performance. In Chapter 4, we study the ability of array-based networks to tolerate faults. Among other things, we show that an $N\times N$ two-dimensional array can sustain $N\sp{1-\epsilon}$ worst-case faults, for some fixed $\epsilon<1,$ and still emulate a fully-functioning $N\times N$ array with only constant slowdown. In Chapter 5, we study a concurrent error detection scheme called Algorithm Based Fault Tolerance (ABFT). Unlike the schemes developed in Chapters 3 and 4 to tolerate permanent faults, the scheme studied in this chapter is primarily aimed at tolerating transient faults in a parallel computer. The main contribution of this chapter is to propose a simple and novel algorithm called RANDGEN to generate data-check relationships. By simply varying its parameters, RANDGEN can produce data-check relationships with a wide spectrum of properties, many of which have been considered important in recent ABFT designs.

[1]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[2]  Jehoshua Bruck,et al.  Tolerating faults in a mesh with a row of spare nodes , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[3]  Jacob A. Abraham,et al.  A Probabilistic Model of Algorithm-Based Fault Tolerance in Array Processors for Real-Time Systems , 1986, RTSS.

[4]  Yonatan Aumann,et al.  Computing with faulty arrays , 1992, STOC '92.

[5]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[6]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[7]  Y. Savaria,et al.  Soft-error filtering: A solution to the reliability problem of future VLSI digital circuits , 1986, Proceedings of the IEEE.

[8]  Bruce M. Maggs,et al.  On-line algorithms for path selection in a nonblocking network , 1990, STOC '90.

[9]  J. Darroch On the Distribution of the Number of Successes in Independent Trials , 1964 .

[10]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[11]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[12]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[13]  Eric J. Schwabe On the computational equivalence of hypercube-derived networks , 1990, SPAA '90.

[14]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Thomas R. Mathies Percolation theory and computing with faulty arrays of processors , 1992, SODA '92.

[16]  Frank Thomson Leighton,et al.  Fast computation using faulty hypercubes , 1989, STOC '89.

[17]  Carlos R. P. Hartmann,et al.  A Novel Concurrent Error Detection Scheme for FFT Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  Arnold L. Rosenberg,et al.  Tolerating Faults in Synchronization Networks , 1992, CONPAR.

[19]  Suku Nair,et al.  Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[20]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[21]  Yonatan Aumann,et al.  Asymptotically optimal PRAM emulation on faulty hypercubes , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[22]  Frank Harary,et al.  Subcube Fault-Tolerance in Hypercubes , 1993, Inf. Comput..

[23]  Bapiraju Vinnakota,et al.  Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[24]  Ted H. Szymanski,et al.  Markov chain analysis of packet-switched banyans with arbitrary switch sizes, queue sizes, link multiplicities and speedups , 1989, IEEE INFOCOM '89, Proceedings of the Eighth Annual Joint Conference of the IEEE Computer and Communications Societies.

[25]  J. Spencer Ten lectures on the probabilistic method , 1987 .

[26]  Hisao Tamaki Robust bounded-degree networks with small diameters , 1992, SPAA '92.

[27]  Arnold L. Rosenberg,et al.  Work-preserving emulations of fixed-connection networks , 1989, STOC '89.

[28]  Marc Snir,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.

[29]  Bruce M. Maggs,et al.  Universal packet routing algorithms , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[30]  Bapiraju Vinnakota,et al.  Design of multiprocessor systems for concurrent error detection and fault diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[31]  Suku Nair,et al.  An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[32]  Bruce M. Maggs,et al.  Randomized Routing and Sorting on Fixed-Connection Networks , 1994, J. Algorithms.

[33]  Howard Jay Siegel,et al.  The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems , 1982, IEEE Transactions on Computers.

[34]  Jacob A. Abraham,et al.  Fault-Tolerant Systems For The Computation Of Eigenvalues And Singular Values , 1986, Optics & Photonics.

[35]  Prabhakar Raghavan,et al.  Probabilistic construction of deterministic algorithms: Approximating packing integer programs , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[36]  Franklin T. Luk,et al.  Fault-Tolerant Matrix Triangularizations on Systolic Arrays , 1988, IEEE Trans. Computers.

[37]  Yuh-Dauh Lyuu Fast-fault-tolerant parallel communication and on-line maintenance using information dispersal , 1990, SPAA '90.

[38]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[39]  Robert Cypher,et al.  Fault-tolerant embeddings of rings, meshes, and tori in hypercubes , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[40]  Abraham Waksman,et al.  A Permutation Network , 1968, JACM.

[41]  Jehoshua Bruck,et al.  Tolerating Faults in Hypercubes Using Subcube Partitioning , 1992, IEEE Trans. Computers.

[42]  Allan Borodin,et al.  Routing, merging and sorting on parallel models of computation , 1982, STOC '82.

[43]  Bruce M. Maggs,et al.  Fast algorithms for bit-serial routing on a hypercube , 1990, SPAA '90.

[44]  Noga Alon,et al.  Fault tolerant graphs, perfect hash functions and disjoint paths , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[45]  Carlos R. P. Hartmann,et al.  A novel concurrent error detection scheme for FFT networks , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[46]  Charles E. Leiserson,et al.  Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..

[47]  William J. Dally,et al.  A VLSI Architecture for Concurrent Data Structures , 1987 .

[48]  Ernst W. Mayr,et al.  Embedding complete binary trees in faulty hypercubes , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[49]  Geng Lin Fault tolerant planar communication networks , 1992, STOC '92.

[50]  S. S. Ravi,et al.  Design and analysis of test schemes for algorithm-based fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[51]  Bruce M. Maggs,et al.  On the fault tolerance of some popular bounded-degree networks , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[52]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[53]  Ravishankar K. Iyer,et al.  PERMANENT CPU ERRORS AND SYSTEM ACTIVITY: MEASUREMENT AND MODELLING. , 1983 .

[54]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[55]  D. C. Opferman,et al.  On a class of rearrangeable switching networks part II: Enumeration studies and fault diagnosis , 1971 .

[56]  C. Greg Plaxton,et al.  Highly fault-tolerant sorting circuits , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[57]  Anna R. Karlin,et al.  Asymptotically tight bounds for computing with faulty arrays of processors , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[58]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[59]  Franklin T. Luk Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers , 1986, Optics & Photonics.

[60]  John P. Hayes,et al.  Designing Fault-Tolerant System Using Automorphisms , 1991, J. Parallel Distributed Comput..

[61]  Daniel P. Siewiorek,et al.  Derivation and Calibration of a Transient Error Reliability Model , 1982, IEEE Transactions on Computers.

[62]  Nobuhiko Koike,et al.  Parallel programming on Cenju : a multiprocessor system for modular circuit simulation , 1990 .

[63]  Charles R. Kime,et al.  System Fault Diagnosis: Closure and Diagnosability with Repair , 1975, IEEE Transactions on Computers.

[64]  Frank Thomson Leighton,et al.  Reconfiguring a hypercube in the presence of faults , 1987, STOC.

[65]  Jehoshua Bruck,et al.  Fault-tolerant meshes with minimal numbers of spares , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[66]  Frank Thomson Leighton,et al.  Coding theory, hypercube embeddings, and fault tolerance , 1991, SPAA '91.

[67]  Leslie G. Valiant,et al.  General Purpose Parallel Architectures , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[68]  Jacob A. Abraham,et al.  Fault Tolerance Techniques for Systolic Arrays , 1987, Computer.

[69]  L. G. Valiant,et al.  Communication issues in parallel computation , 1990 .

[70]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[71]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[72]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[73]  John P. Hayes,et al.  On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures , 1990, IEEE Trans. Computers.

[74]  Eric J. Schwabe,et al.  Efficient embeddings and simulations for hypercubic networks , 1991 .

[75]  M. M. Yen,et al.  Designing for concurrent error detection in VLSI: application to a microprogram control unit , 1987 .

[76]  Fred S. Annexstein Fault tolerance in hypercube-derivative networks , 1989, SPAA '89.

[77]  Jehoshua Bruck,et al.  Efficient fault-tolerant mesh and hypercube architectures , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[78]  Eli Upfal,et al.  Fault tolerant sorting network , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[79]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[80]  Bruce M. Maggs,et al.  Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks , 1992, IEEE Trans. Computers.

[81]  Abbas El Gamal,et al.  Configuration of VLSI Arrays in the Presence of Defects , 1984, JACM.

[82]  Richard Koch Increasing the Size of a Network by a Constant Factor can Increase Performance by more than a Constant Factor , 1992, SIAM J. Comput..

[83]  Mariagiovanna Sami,et al.  Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays , 1989 .

[84]  Bruce M. Maggs,et al.  Simple algorithms for routing on butterfly networks with bounded queues , 1992, STOC '92.

[85]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[86]  H. Suzuki,et al.  Output-buffer switch architecture for asynchronous transfer mode , 1989, IEEE International Conference on Communications, World Prosperity Through Communications,.