Processing Element and Custom Chip Architecture for the BLITZEN Massively Parallel Processor

The BLITZEN PE, Rev. 1-26 clear the shift register. One performance option is still open and will not be resolved until VLSI design is underway. It is anticipated that on-chip propagation delays will be very small and the clock speed can be very high. Perhaps the difference between on-chip processing rates and the rate at which signals can cross chip boundaries can be a factor of two or four. The option is to design the chip with dual operation rates based on activity internal to the chip versus activity with signals crossing chip boundaries. Instructions are sent from a control unit and must cross chip boundaries. The idea is to use one instruction for multiple internal processing cycles. This would improve performance when instructions are repeated for multiple bit fields, as with arithmetic on multiple bit data items. This option is still under study. The work reported in this paper resulted from the efforts of a group of researchers participating in this project at MCNC. In addition to the authors, these people included Fred Heaton (MCNC) and Jothy Rosenberg (Duke). We also benefitted from discussions with Kenneth Batcher of Loral Systems and Charles Fiduccia of General Electric. The specifications of a processing element and a chip architecture for a massively parallel processor have been given. This report can be used as a requirements document for the BLITZEN PE VLSI design effort. During the development of the BLITZEN PE architecture various alternatives were considered. This section comments on some of the options. One goal of this effort is to develop a system which is miniaturized as compared to other commercially available systems. To achieve a reduction in system size, it is desirable to eliminate the MPP staging memory. The staging memory buffers and reformats data. It does the corner-turning necessary for bit-serial processing. The BLITZEN system must support corner turning. One method considered was to have two S Planes running in orthogonal directions. This would allow movement of data in both East-West and North-South directions overlapped with other processing. In the final design an I/O bus was selected for the East-West movement. A modified North-South S plane, in which the S paths are limited to on-chip PE's, was considered as support for corner-turning. It was decided that off chip devices were better able to provide the function without occupying chip area. A related consideration concerned the need for a routing network …