Fault-Tolerant Computing on POEM,

Abstract : Wafer scale integration (WSI) promises to realize a complete multiprocessing system on the same wafer and eliminates the expensive steps required to dice and bond. The fundamental belief is that the internal connection between chips on the same wafer are more reliable and have a smaller propagation delay than external connections. However, achieving a high yield has proven to be a major challenge. Rather than aiming for 100% yield, the realistic solution is to determine the defective components on the wafer and replace them with spares. Which means, the design should be tolerant to faults developed during the manufacturing process. Moreover, faults occur during system operation, be it component failure, improper operation, or environmental factors. Therefore, a mena to detect these unexpected faults and recover from them is necessary to minimize down time and unavailability. Long and periodic system downs are a luxury that cannot be afforded for computers used in critical applications. In this paper, we show that the introduction of optical interconnection techniques into a multiprocessor environment (e.g. the Programmable Optoelectronic Multiprocessor, POEM) enables efficient implementation of fault-tolerant techniques.