Dual-Data Rate Transpose-Memory Architecture Improves the Performance, Power and Area of Signal-Processing Systems

This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm 2 silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area.

[1]  Anantha Chandrakasan,et al.  A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications , 2014, IEEE Journal of Solid-State Circuits.

[2]  Hung-Chi Fang,et al.  Parallel 4/spl times/4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264 , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[3]  Satoshi Goto,et al.  A Low-Cost VLSI Architecture of Multiple-Size IDCT for H.265/HEVC , 2014, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[4]  Yongjie Liu,et al.  A fast, pipelined implementation of a two-dimensional inverse discrete cosine transform , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[5]  Mohamed El-Hadedy,et al.  Performance and area efficient transpose memory architecture for high throughput adaptive signal processing systems , 2010, 2010 NASA/ESA Conference on Adaptive Hardware and Systems.

[6]  Gary J. Sullivan,et al.  Efficient quadtree coding of images and video , 1994, IEEE Trans. Image Process..

[7]  S. Bampi,et al.  Pipelined fast 2D DCT architecture for JPEG image compression , 2001, Symposium on Integrated Circuits and Systems Design.

[8]  Gianluca Palermo,et al.  A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs , 2007, IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07).

[9]  Athanassios N. Skodras,et al.  A high speed FPGA implementation of the 2D DCT for Ultra High Definition video coding , 2013, 2013 18th International Conference on Digital Signal Processing (DSP).

[10]  Srinivas Katkoori,et al.  VLSI-SoC: From Algorithms to Circuits and System-on-Chip Design , 2012, IFIP Advances in Information and Communication Technology.

[11]  Jun Rim Choi,et al.  A 400 MPixel/s IDCT for HDTV by multibit coding and group symmetry , 1997 .

[12]  Gustavo A. Ruiz,et al.  Memory efficient programmable processor chip for inverse Haar transform , 1998, IEEE Trans. Signal Process..

[13]  Tian-Sheuan Chang,et al.  A reconfigurable inverse transform architecture design for HEVC decoder , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[14]  Dongsheng Wang,et al.  Fully pipelined DCT/IDCT/Hadamard unified transform architecture for HEVC Codec , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[15]  Paul T. Boggs,et al.  Sequential Quadratic Programming , 1995, Acta Numerica.

[16]  M.A. Ashour,et al.  Hardware implementation of the encoder modified mid-band exchange coefficient technique (MMBEC) based on FPGA , 2007, 2007 Internatonal Conference on Microelectronics.

[17]  Tom Dillon An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs , 2004 .

[18]  Yun He,et al.  A Highly Parallel Joint VLSI Architecture for Transforms in H.264/AVC , 2008, J. Signal Process. Syst..

[19]  Jiun-In Guo,et al.  An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Stamatis Vassiliadis,et al.  DCT and IDCT Implementations on Different FPGA Technologies , 2022 .

[21]  Thomas Sri Widodo,et al.  FPGA implementation of pipelined 2D-DCT and quantization architecture for JPEG image compression , 2010, 2010 International Symposium on Information Technology.

[22]  Alan N. Willson,et al.  A 100 MHz 2-D 8×8 DCT/IDCT processor for HDTV applications , 1995, IEEE Trans. Circuits Syst. Video Technol..

[23]  Chein-Wei Jen,et al.  A cost-effective MPEG-4 shape-adaptive DCT with auto-aligned transpose memory organization , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[24]  Joseph J. Hope,et al.  XMDS2: Fast, scalable simulation of coupled stochastic partial differential equations , 2012, Comput. Phys. Commun..

[25]  Preeti Ranjan Panda,et al.  Memory Architecture Exploration for Power-Efficient 2D-Discrete Wavelet Transform , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[26]  Anantha Chandrakasan,et al.  Quad Full-HD transform engine for dual-standard low-power video coding , 2011, IEEE Asian Solid-State Circuits Conference 2011.

[27]  Keshab K. Parhi,et al.  Implementation approaches for the Advanced Encryption Standard algorithm , 2002 .

[28]  J. Astola,et al.  Additional lossless compression of JPEG images , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[29]  Takao Onoye,et al.  VLSI implementation of inverse discrete cosine transformer and motion compensator for MPEG2 HDTV video decoding , 1995, IEEE Trans. Circuits Syst. Video Technol..

[30]  Mahmoud Reza Hashemi,et al.  An Efficient Self-Transposing Memory Structure for 32-bit Video Processors , 2006, APCCAS 2006 - 2006 IEEE Asia Pacific Conference on Circuits and Systems.

[31]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[32]  Peter Pirsch,et al.  Using SDRAMs for two-dimensional accesses of long 2n × 2m-point FFTs and transposing , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[33]  Yuan-Ho Chen,et al.  Area-efficient video transform for HEVC applications , 2015 .

[34]  Frank Kienle,et al.  Flexible Radio Design: Trends and Challenges in Digital Baseband Implementation , 2012, VLSI Design.

[35]  Satoshi Goto,et al.  An area-efficient 4/8/16/32-point inverse DCT architecture for UHDTV HEVC decoder , 2014, 2014 IEEE Visual Communications and Image Processing Conference.

[36]  Chunming Zhang,et al.  Accelerating 2D FFT with Non-Power-of-Two Problem Size on FPGA , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[37]  Jooheung Lee,et al.  A Self-Reconfigurable Platform for Scalable DCT Computation Using Compressed Partial Bitstreams and BlockRAM Prefetching , 2009, 2009 IEEE Computer Society Annual Symposium on VLSI.

[38]  Weiwei Shen,et al.  A Unified 4/8/16/32-Point Integer IDCT Architecture for Multiple Video Coding Standards , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[39]  Li Dong,et al.  Two-dimensional image processing without transpose , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[40]  Seongsoo Lee,et al.  2-D Large Inverse Transform (16×16, 32×32) for HEVC (High Efficiency Video Coding) , 2012 .

[41]  Ilker Hamzaoglu,et al.  A low energy HEVC inverse transform hardware , 2014, IEEE Transactions on Consumer Electronics.

[42]  Weiwei Shen,et al.  Single-Port SRAM-Based Transpose Memory With Diagonal Data Mapping for Large Size 2-D DCT/IDCT , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[43]  Yutai Ma,et al.  An effective memory addressing scheme for FFT processors , 1999, IEEE Trans. Signal Process..

[44]  Javier D. Bruguera,et al.  A Unified Architecture for H.264 Multiple Block-Size DCT with Fast and Low Cost Quantization , 2006, 9th EUROMICRO Conference on Digital System Design (DSD'06).

[45]  Valentin Muresan,et al.  An optimal adder-based hardware architecture for the DCT/SA-DCT , 2005, Visual Communications and Image Processing.

[46]  Yu Hen Hu,et al.  Efficient VLSI implementations of fast multiplierless approximated DCT using parameterized hardware modules for silicon intellectual property design , 2005, IEEE Transactions on Circuits and Systems I: Regular Papers.

[47]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[48]  Anantha Chandrakasan,et al.  Quad Full-HD Transform Engine for Dual-Standard Low-Power Video Coding , 2012, IEEE Journal of Solid-State Circuits.

[49]  Nagarajan Ranganathan,et al.  JAGUAR: a fully pipelined VLSI architecture for JPEG image compression standard , 1995, Proc. IEEE.

[50]  Jooheung Lee,et al.  Scalable FPGA-based architecture for DCT computation using dynamic partial reconfiguration , 2009, TECS.

[51]  Muhammad Usman Shahid,et al.  Point DCT VLSI Architecture for Emerging HEVC Standard , 2012, VLSI Design.