Efficient Vertical/Horizontal-Space 1D-DCT Processing Based on Massive-Parallel Matrix-Processing Engine

This paper reports an efficient discrete cosine transform (DCT) processing for the JPEG algorithm using a massive-parallel memory-embedded SIMD matrix processor. The matrix-processing engine has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. For compatibility with this matrix-processing architecture, the conventional DCT algorithm has been improved in arithmetic order and the vertical/horizontal-space 1 dimensional (1D)-DCT processing has been further developed. Evaluation results of the matrix-engine-based DCT processing show that the necessary clock cycles per image blocks can be reduced by 87% in comparison to a conventional DSP architecture. The determined performances in MOPS and MOPS/mm are factors 8 and 5.6 better than with a conventional DSP, respectively. Moreover, the matrix-processing engine can reduce the number of total clock cycles for JPEG application about 49% in comparison to a conventional DSP architecture.