Adaptive In-Cache Streaming for Efficient Data Management

The design of adaptive architectures is frequently focused on the sole adaptation of the processing blocks, often neglecting the power/performance impact of data transfers and data indexing in the memory subsystem. In particular, conventional address-based models, supported on cache structures to mitigate the memory wall problem, often struggle when dealing with memory-bound applications or arbitrarily complex data patterns that can be hardly captured by prefetching mechanisms. Stream-based techniques have proven to efficiently tackle such limitations, although not well-suited to handle all types of applications. To mitigate the limitations of both communication paradigms, an efficient unification is herein proposed, by means of a novel in-cache stream paradigm, capable of seamlessly adapting the communication between the address-based and stream-based models. The proposed morphable infrastructure relies on a new dynamic descriptor graph specification, capable of handling regular arbitrarily complex data patterns, which is able to improve the main memory bandwidth utilization through data reutilization and reorganization techniques. When compared with state-of-the-art solutions, the proposed structure offers higher address generation efficiency and achievable memory throughputs, and a significant reduction of the amount of data transfers and main memory accesses, resulting on average in 13 times system performance speedup and in 245 times energy-delay product improvement, when compared with the previous implementations.

[1]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[2]  Kai-Kuang Ma,et al.  A new diamond search algorithm for fast block-matching motion estimation , 2000, IEEE Trans. Image Process..

[3]  Fadi N. Sibai,et al.  V-Set Cache: an Efficient Adaptive Shared Cache for Multi-Core Processors , 2014, J. Circuits Syst. Comput..

[4]  Pedro C. Diniz,et al.  Data Reorganization and Prefetching of Pointer-Based Data Structures , 2011, IEEE Design & Test of Computers.

[5]  René van Leuken,et al.  MB-LITE: A robust, light-weight soft-core implementation of the MicroBlaze architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[6]  Pedro C. Diniz,et al.  Synthesis of pipelined memory access controllers for streamed data applications on FPGA-based computing engines , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[7]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Timothy M. Jones,et al.  The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation , 2012, International Journal of Parallel Programming.

[10]  Rakesh Kumar,et al.  The Case for Message Passing on Many-Core Chips , 2011, Multiprocessor System-on-Chip.

[11]  Nuno Roma,et al.  In-Cache Streaming: Morphable Infrastructure for Many-Core Processing Systems , 2016, Euro-Par Workshops.

[12]  A Thesis,et al.  Tiling Stencil Computations to Maximize Parallelism , 2013 .

[13]  Hamid Sarbazi-Azad,et al.  Application-Aware Topology Reconfiguration for On-Chip Networks , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[14]  Csaba Andras Moritz,et al.  Energy-Efficient Hardware Data Prefetching , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Nuno Roma,et al.  Efficient data-stream management for shared-memory many-core systems , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[16]  Dionisios N. Pnevmatikatos,et al.  FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[17]  Sanjiva Prasad,et al.  ReKonf: A Reconfigurable Adaptive ManyCore Architecture , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[18]  Eduard Ayguadé,et al.  Advanced Pattern based Memory Controller for FPGA based HPC applications , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[19]  Nicolas Vasilache,et al.  GRAPHITE : Polyhedral Analyses and Optimizations for GCC , 2006 .

[20]  Jason Cong,et al.  Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Wayne Luk,et al.  Heterogeneous Reconfigurable System for Adaptive Particle Filters in Real-Time Applications , 2013, ARC.

[22]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[23]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[25]  Nuno Roma,et al.  HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels , 2013, 2013 25th International Symposium on Computer Architecture and High Performance Computing.

[26]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[27]  Paulo F. Flores,et al.  Multicore SIMD ASIP for Next-Generation Sequencing and Alignment Biochip Platforms , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Hongbin Zheng,et al.  Polly – Polyhedral optimization in LLVM , 2012 .

[29]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[30]  Kei Hiraki,et al.  Access Map Pattern Matching for High Performance Data Cache Prefetch , 2011, J. Instr. Level Parallelism.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.