Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.

[1]  Dong Wang,et al.  An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[2]  Hideharu Amano,et al.  A cost-effective context memory structure for dynamically reconfigurable processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[3]  Meikang Qiu,et al.  Revealing Feasibility of FMM on ASIC: Efficient Implementation of N-Body Problem on FPGA , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[4]  Li Guo,et al.  Architecture design of low-power motion estimation based on DHS-NPDS for H.264/AVC , 2011, Science China Information Sciences.

[5]  A. Bigdeli,et al.  Multimedia extensions for a reconfigurable processor , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[6]  陈凡,et al.  Institute of Microelectronics, Tsinghua University, Beijing 100084, China , 2013 .

[7]  Leibo Liu,et al.  On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Ming Chen,et al.  Adaptive BER-constraint-based power allocation for downlink MC-CDMA systems with linear MMSE receiver , 2010, 2010 IEEE 12th International Conference on Communication Technology.

[9]  Shouyi Yin,et al.  An efficient VLSI architecture of speeded-up robust feature extraction for high resolution and high frame rate video , 2013, Science China Information Sciences.

[10]  Rabi N. Mahapatra,et al.  Dynamic Context Compression for Low-Power Coarse-Grained Reconfigurable Architecture , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[12]  He Chen,et al.  A novel conflict-free parallel memory access scheme for FFT constant geometry architectures , 2013, Science China Information Sciences.

[13]  Kue-Hwan Sihn,et al.  Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture , 2007, 2007 IEEE International Symposium on Signal Processing and Information Technology.

[14]  Jürgen Becker,et al.  H. 264 Decoder at HD Resolution on a Coarse Grain Dynamically Reconfigurable Architecture , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[15]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[16]  Zhongming Wang,et al.  Bitstream decoding and SEU-induced failure analysis in SRAM-based FPGAs , 2011, Science China Information Sciences.

[17]  Hannu Tenhunen,et al.  Compression Based Efficient and Agile Configuration Mechanism for Coarse Grained Reconfigurable Architectures , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Michalis D. Galanis,et al.  Alleviating the data memory bandwidth bottleneck in coarse-grained reconfigurable arrays , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[19]  Gary J. Sullivan,et al.  Performance comparison of video coding standards using Lagrangian coder control , 2002, Proceedings. International Conference on Image Processing.

[20]  Tianshi Chen,et al.  Statistical Performance Comparisons of Computers , 2012, IEEE Transactions on Computers.

[21]  Bingfeng Mei,et al.  Mapping an H.264/AVC decoder onto the ADRES reconfigurable architecture , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[22]  Aviral Shrivastava,et al.  High Throughput Data Mapping for Coarse-Grained Reconfigurable Architectures , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  Thomas Wiegand,et al.  Draft ITU-T recommendation and final draft international standard of joint video specification , 2003 .

[24]  Hideharu Amano,et al.  RoMultiC: fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[25]  Will Moffat,et al.  Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[26]  Guanyi Sun,et al.  ReSSIM: a mixed-level simulator for dynamic coarse-grained reconfigurable processor , 2013, Science China Information Sciences.

[27]  Leibo Liu,et al.  Hierarchical representation of on-chip context to reduce reconfiguration time and implementation area for coarse-grained reconfigurable architecture , 2013, Science China Information Sciences.

[28]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[29]  Bo Chen,et al.  Near lossless compression of hyperspectral images based on distributed source coding , 2012, Science China Information Sciences.

[30]  Liang-Gee Chen,et al.  A 59.5mW scalable/multi-view video decoder chip for Quad/3D Full HDTV and video streaming applications , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[31]  Jie Huang,et al.  Image restoration with shifting reflective boundary conditions , 2011, Science China Information Sciences.

[32]  Yen-Kuang Chen,et al.  Implementation of H.264 decoder on general-purpose processors with media instructions , 2003, IS&T/SPIE Electronic Imaging.

[33]  Guangming Shi,et al.  A high quality image reconstruction method based on nonconvex decoding , 2013, Science China Information Sciences.

[34]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.