Memory-Access-Driven Context Partitioning for Window-Based Image Processing on Heterogeneous Multicore Processors

Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.

[1]  Xiaobo Li,et al.  XOR Storage Schemes for Frequently Used Data Patterns , 1995, J. Parallel Distributed Comput..

[2]  Masanori Hariyama,et al.  Acceleration of Optical-Flow Extraction Using Dynamically Reconfigurable ALU Arrays , 2009, ERSA.

[3]  Chaur-Heh Hsieh,et al.  VLSI architecture for block-matching motion estimation algorithm , 1992, IEEE Trans. Circuits Syst. Video Technol..

[4]  S. M. Lee,et al.  An FPGA-Oriented Motion-Stereo Processor with a Simple Interconnection Network for Parallel Memory Access , 2000 .

[5]  Malgorzata Marek-Sadowska,et al.  Partitioning Sequential Circuits on Dynamically Reconfigurable FPGAs , 1999, IEEE Trans. Computers.

[6]  亀山 充隆 High-performance field programmable VLSI processor based on a direct allocation of a control/data flow graph , 2002 .

[7]  Yasuhiro Kobayashi,et al.  Optimal Periodic Memory Allocation for Image Processing With Multiple Windows , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Masanori Hariyama,et al.  Architecture of a stereo matching VLSI processor based on hierarchically parallel memory access , 2004 .

[9]  N. Okumura,et al.  Design and Implementation of a Configurable Heterogeneous Multicore SoC With Nine CPUs and Two Matrix Processors , 2008, IEEE Journal of Solid-State Circuits.

[10]  P.S. Brandao do Nascimento,et al.  Temporal Partitioning for Image Processing Based on Time-Space Complexity in Reconfigurable Architectures , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[11]  Hironori Kasahara,et al.  Heterogeneous Multicore Architecture , 2012 .

[12]  Masanori Hariyama,et al.  Memory Allocation for Window-Based Image Processing on Multiple Memory Modules with Simple Addressing Functions , 2011, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[13]  Peter Pirsch,et al.  Array architectures for block matching algorithms , 1989 .

[14]  Steven Trimberger,et al.  A time-multiplexed FPGA , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[15]  Krishnamoorthy Sivakumar,et al.  Morphologically Constrained GRFs: Applications to Texture Synthesis and Analysis , 1999, IEEE Trans. Pattern Anal. Mach. Intell..