CIACP: A Correlation- and Iteration- Aware Cache Partitioning Mechanism to Improve Performance of Multiple Coarse-Grained Reconfigurable Arrays

Multiple coarse-grained reconfigurable arrays (CGRA), which are organized in parallel or pipeline to complete applications, have become a productive solution to balance the performance with the flexibility. One of the keys to obtain high performance from multiple CGRAs is to manage the shared on-chip cache efficiently to reduce off-chip memory bandwidth requirements. Cache partitioning has been viewed as a promising technique to enhance the efficiency of a shared cache. However, the majority of prior partitioning techniques were developed for multi-core platform and aimed at multi-programmed workloads. They cannot directly address the adverse impacts of data correlation and computation imbalance among competing CGRAs in multi-CGRA platform. This paper proposes a correlation- and iteration- aware cache partitioning (CIACP) mechanism for shared cache partitioning in multiple CGRAs systems. This mechanism employs correlation monitors (CMONs) to trace the amount of overlapping data among parallel CGRAs, and iteration monitors (IMONs) to track the computation load of each CGRA. Using the information collected by CMONs and IMONs, the CIACP mechanism can eliminate redundant cache utilization of the overlapping data and can also shorten the total execution time of pipelined CGRAs. Experimental results showed that CIACP outperformed state-of-the-art utility-based cache partitioning techniques by up to 16 percent in performance.

[1]  Roberto Guerrieri,et al.  Application Space Exploration of a Heterogeneous Run-Time Configurable Digital Signal Processor , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Zhiyi Yu,et al.  Low-Power Multicore Processor Design With Reconfigurable Same-Instruction Multiple Process , 2014, IEEE Transactions on Circuits and Systems II: Express Briefs.

[3]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[4]  Vijay S. Pai,et al.  Imbalanced cache partitioning for balanced data-parallel programs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Chenjie Yu,et al.  Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms , 2010, Design Automation Conference.

[6]  Roberto Guerrieri,et al.  A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing , 2010, IEEE Journal of Solid-State Circuits.

[7]  N. Voros,et al.  Dynamic System Reconfiguration in Heterogeneous Platforms , 2009 .

[8]  Jari Nurmi,et al.  Design of an accelerator-rich architecture by integrating multiple heterogeneous coarse grain reconfigurable arrays over a network-on-chip , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[9]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[10]  Leibo Liu,et al.  Polyhedral model based mapping optimization of loop nests for CGRAs , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Yosi Ben-Asher,et al.  Overlapping memory operations with circuit evaluation in reconfigurable computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[13]  Rudy Lauwereins,et al.  A Coarse-Grained Array Accelerator for Software-Defined Radio Baseband Processing , 2008, IEEE Micro.

[14]  Gerard J. M. Smit,et al.  Towards Software Defined Radios Using Coarse-Grained Reconfigurable Hardware , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Jason Cong,et al.  Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[16]  Wojciech Czaja,et al.  A case study on data fusion with overlapping segments , 2013, 2013 IEEE Applied Imagery Pattern Recognition Workshop (AIPR).

[17]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[18]  Lizhong Chen,et al.  Futility Scaling: High-Associativity Cache Partitioning , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[20]  Daniel Sánchez,et al.  Talus: A simple way to remove cliffs in cache performance , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[22]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[23]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[24]  Aviral Shrivastava,et al.  Enabling Multithreading on CGRAs , 2011, 2011 International Conference on Parallel Processing.

[25]  Victor Y. Chen,et al.  SimRPU: A Simulation Environment for Reconfigurable Architecture Exploration , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Ching-Wen Chen,et al.  Multiple Channels with Overlapping Data Sub-Channel Method for Mobile Ad Hoc Networks , 2007, 2007 IEEE Wireless Communications and Networking Conference.

[27]  Dong Wang,et al.  An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[28]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[29]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[30]  Hiroshi Nakamura,et al.  Dynamic power control with a heterogeneous multi-core system using a 3-D wireless inductive coupling interconnect , 2012, 2012 International Conference on Field-Programmable Technology.

[31]  Daniel P. Siewiorek,et al.  A resource allocation model for QoS management , 1997, Proceedings Real-Time Systems Symposium.

[32]  Eric Rotenberg,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33]  Leibo Liu,et al.  Acceleration of control flows on Reconfigurable Architecture with a composite method , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Tulika Mitra,et al.  Heterogeneous Multi-core Architectures , 2015, IPSJ Trans. Syst. LSI Des. Methodol..

[35]  Sujit Dey,et al.  Variation aware cache partitioning for multithreaded programs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Fadi J. Kurdahi,et al.  A framework for reconfigurable computing: task scheduling and context management , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[37]  Jack J. Dongarra,et al.  L2 Cache Modeling for Scientific Applications on Chip Multi-Processors , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[38]  Paolo Ienne,et al.  Elastic CGRAs , 2013, FPGA '13.

[39]  R. Govindarajan,et al.  Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40]  Russell Tessier,et al.  Reconfigurable Computing Architectures , 2015, Proceedings of the IEEE.

[41]  Young-Hwan Park,et al.  Software-defined DVT-T2 demodulator using scalable DSP processors , 2013, IEEE Transactions on Consumer Electronics.

[42]  Dong Wang,et al.  An Energy-Efficient Coarse-Grained Reconfigurable Processing Unit for Multiple-Standard Video Decoding , 2015, IEEE Transactions on Multimedia.

[43]  Abdullah Atalar,et al.  BilRC: An Execution Triggered Coarse Grained Reconfigurable Architecture , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[45]  Jason Cong,et al.  Accelerator-rich architectures: Opportunities and progresses , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[46]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[47]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[48]  Leibo Liu,et al.  On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[49]  Nikolaos S. Voros,et al.  Dynamic System Reconfiguration in Heterogeneous Platforms , 2009 .

[50]  Daniel Sánchez,et al.  Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[51]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .