Minimum multiplicative complexity implementation of the 2D DCT using Xilinx FPGAs

This paper investigates two options for the field programmable gate array (FPGA) implementation of a very high-performance 2D discrete cosine transform (DCT) processor for real-time applications. The first architecture exploits the transform separability and uses a row-column decomposition. The row and column processors are realized using distributed arithmetic (DA) techniques. The second approach uses a naturally 2D method based on polynomial transforms. The paper provides an overview of the DCT calculation using DA methods and describes the FPGA implementation. A tutorial overview of a computationally efficient method for computing 2D DCTs using polynomial transforms is presented. A detailed analysis of the datapath for this approach using an 8 X 8 data-set is given. Comparisons are made that show the polynomial transform approach to require 67% of the logic resources of a DA processor for equal throughputs. The polynomial transform approach is also shown to scale better with increasing block size than the DA approach.