Abacus - a reconfigurable bit-parallel architecture for early vision

Many important computational problems, including those of computer vision, are characterized by data-parallel, low-precision integer operations on large volumes of data. For such highly structured problems, this thesis develops Abacus, a high-speed reconngurable SIMD (single-instruction, multiple-data) architecture that outperforms conventional microprocessors by over an order of magnitude using the same silicon resources. Earlier SIMD systems computed at relatively slow clock rates compared to their unipro-cessor counterparts. The thesis discusses the problems involved in operating a large SIMD system at high clock rates, including instruction distribution and chip-to-chip communication , presents the solutions adopted by the Abacus design. Although the chip was implemented in a 1989-era VLSI technology, it was designed to contain 1024 processing elements (PEs), operate at 125 MHz, and deliver 2 billion 16-bit arithmetic operations per second (GOPS). The PE and chip architecture are described in detail, as well as the results of testing the chip at 100 MHz. Despite this high performance, the Abacus one-bit ALU is not the optimal point in the design space. An analytical model is developed for performance as a function of ALU width and oo-chip memory bandwidth. Intuition provided by the model leads to the conclusion that an eight-bit ALU is an optimal choice for the current technology. Finally, using the analytical model, area and time parameters from the Abacus chip, and some lessons learned from the chip implementation, a design is presented for a 320 GOPS low-cost single-board system. Abstract Many important computational problems, including those of computer vision, are characterized by data-parallel, low-precision integer operations on large volumes of data. For such highly structured problems, this thesis develops Abacus, a high-speed reconngurable SIMD (single-instruction, multiple-data) architecture that outperforms conventional microprocessors by over an order of magnitude using the same silicon resources. Earlier SIMD systems computed at relatively slow clock rates compared to their unipro-cessor counterparts. The thesis discusses the problems involved in operating a large SIMD system at high clock rates, including instruction distribution and chip-to-chip communication , presents the solutions adopted by the Abacus design. Although the chip was implemented in a 1989-era VLSI technology, it was designed to contain 1024 processing elements (PEs), operate at 125 MHz, and deliver 2 billion 16-bit arithmetic operations per second (GOPS). The PE and chip architecture are described in detail, as well as the results of testing the chip at 100 MHz. Despite this high performance, the Abacus one-bit ALU is not …

[1]  GokhaleMaya,et al.  Processing in Memory , 1995 .

[2]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[3]  Martin C. Herbordt,et al.  Practical Algorithms for Online Routing on Fixed and Reconfigurable Meshes , 1994, J. Parallel Distributed Comput..

[4]  Charles C. Weems,et al.  Image understanding architecture and applications , 1988 .

[5]  James J. Little,et al.  Parallel Optical Flow Using Local Voting , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[6]  Rajiv V. Joshi,et al.  A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture , 1991 .

[7]  Jean Vuillemin,et al.  Programmable Active Memories: A Performance Assessment , 1992, Heinz Nixdorf Symposium.

[8]  Todd Elliot Rockoff An Analysis of Instruction-Cached SIMD Computer Architecture , 1994 .

[9]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[10]  Mary Jane Irwin,et al.  A Two-Dimensional, Distributed Logic Architecture , 1991, IEEE Trans. Computers.

[11]  E. L. Hudson,et al.  A variable delay line PLL for CPU-coprocessor synchronization , 1988 .

[12]  Peter M. Kogge,et al.  Combined DRAM and logic chip for massively parallel systems , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[13]  Adi Shamir,et al.  An optimal sorting algorithm for mesh connected computers , 1986, STOC '86.

[14]  J. Harris The Coupled Depth/Slope Approach to Surface Reconstruction , 1986 .

[15]  Jorge L. C. Sanz,et al.  Algorithms for Image Component Labeling on SIMD Mesh-Connected Computers , 1987, IEEE Trans. Computers.

[16]  Massimo Maresca,et al.  Polymorphic-Torus Architecture for Computer Vision , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Edward W. Davis,et al.  BLITZEN: a highly integrated massively parallel machine , 1988, Proceedings., 2nd Symposium on the Frontiers of Massively Parallel Computation.

[18]  Lawrence Snyder,et al.  Architectural tradeoffs in parallel computer design , 1989 .

[19]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[20]  Stefano Levialdi,et al.  On shrinking binary picture patterns , 1972, CACM.

[21]  Martin H. Graham,et al.  Book Review: High-Speed Digital Design: A Handbook of Black Magic by Howard W. Johnson and Martin Graham: (Prentice-Hall, 1993) , 1993, CARN.

[22]  Dennis Parkinson,et al.  Massively parallel computing with the DAP , 1990 .

[23]  A. El Gamal,et al.  Regenerative feedback repeaters for programmable interconnections , 1995 .

[24]  Walter B. Ligon,et al.  Evaluating Multigauge Architectures for Computer Vision , 1994, J. Parallel Distributed Comput..

[25]  Vernon L. Chi Salphasic Distribution of Clock Signals for Synchronous Systems , 1994, IEEE Trans. Computers.

[26]  Charles Sodini,et al.  System design for pixel-parallel image processing , 1996, IEEE Trans. Very Large Scale Integr. Syst..

[27]  Alok Nidhi Choudhary,et al.  Parallel architectures and parallel algorithms for integrated vision systems , 1989 .

[28]  David E. Schimmel,et al.  Issues in the Design of High Performance SIMD Architectures , 1996, IEEE Trans. Parallel Distributed Syst..

[29]  Jean-Loup Baer,et al.  Proceedings of the 39th Annual International Symposium on Computer Architecture , 1983, International Symposium on Computer Architecture.

[30]  Yvon Savaria,et al.  Performance improvements to VLSI parallel systems, using dynamic concatenation of processing resources , 1992, Parallel Comput..

[31]  Andrew Rushton Reconfigurable Processor Array A Bit Sliced Parallel Computer , 1986 .

[32]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[33]  Gianni Conte,et al.  Massively Parallel Processor , 1994 .

[34]  Y. J. Tejwani,et al.  Robot vision , 1989, IEEE International Symposium on Circuits and Systems,.

[35]  Mary Jane Irwin,et al.  The MGAP-2: a micro-grained massively parallel array processor , 1995, Proceedings of Eighth International Application Specific Integrated Circuits Conference.

[36]  Martin C. Herbordt,et al.  The evaluation of massively parallel array architectures , 1995 .

[37]  Robert Cypher,et al.  The Hough Transform has O(N) Complexity on SIMD N x N Mesh Array Architectures. , 1987 .

[38]  Sotirios G. Ziavras Connected component labelling on the BLITZEN massively parallel processor , 1993, Image Vis. Comput..