论文信息 - Fault-tolerant bitonic sorting networks and static shuffle-exchange networks

Fault-tolerant bitonic sorting networks and static shuffle-exchange networks

As the complexity of parallel systems grows, so does the probability of failure. Transient, intermittent and permanent faults may occur in processors and/or interconnection networks. Therefore, fault-tolerant interconnection networks are essential to the reliability of parallel computer systems. A fault-tolerant network has the ability to route information even if certain network components, processors, switches, or/and links, fail. A large-scale bitonic sorting network can be used as a flexible means of tying together the various parts of a large-scale multiprocessing computer system due to its fast sorting and ordering capability. A static shuffle-exchange network is another good interconnection pattern for a large-scale (distributed) multiprocessing computer system due to its low diameter, low degree, and only two interconnection functions. However, both networks lack fault-tolerance. A new fault detection algorithm and recovery technique for bitonic sorting networks is proposed. A single fault on the comparison elements or links can be detected and diagnosed by inserting $O(log\sb2 N)$ sets of testing vectors. The basic testing vectors consist basically of subsets and combinations of ascending $(0,\ 1,\ 2,\ 3,\cdots)$ and descending $(\cdots, 3,\ 2,\ 1,\ 0)$ vectors. In order to recover from a fault additional comparison elements and additional links for each comparison element are used. The Dynamic Bitonic Sorting network is based on the Dynamic Redundant network and can be constructed with many different variations. Multiple fault tolerant static shuffle-exchange (recirculating perfect shuffle) networks are presented. In order to recover from k faulty processing elements, a network needs at least 2k additional processing elements and at most 4k additional shuffle ports for each processing elements: 2k for the shuffle-out ports and 2k for the shuffle-in ports. By decomposing the k-fault tolerant static shuffle-exchange network into m identical modules, the reliability of the network can be increased. A distributed fault-diagnosis scheme for static shuffle-exchange network is also presented, any single faulty node or edge in a static shuffle-exchange network can be detected and located within $O(\log\sb2 N)$ time.

Kenneth E. Batcher | Hongin Choi