VLSI implementation of a 16*16 discrete cosine transform

The implementation of a 16*16 discrete cosine transform (DCT) chip using a concurrent architecture is presented. The chip contains 32 processing elements working in parallel and a random-access memory (RAM) which performs a 16*16 matrix transposition. The structure is highly regular and modular, and thus very efficient for VLSI implementation. The chip was designed for real-time processing of 14.3-MHz sample video data. It performs an equivalent of a half billion multiplications and accumulations per second. Fabricated in 2- mu m double-metal CMOS technology, the chip contains approximately 73000 transistors which occupy a 7.2*7.0-mm/sup 2/ area. The 68-pad die size is 8.3*8.1 mm/sup 2/. It is fully functional and is the first working 16*16 DCT chip. The architecture and accuracy studies for finite-wordlength processing are presented. The circuit design and layout using the symbolic design tool MULGA are described in detail. Possible variations are also discussed for multipurpose (variable transform sizes, forward-inverse transform) applications. >