Cache Balancer: A communication latency and utilization aware resource manager

The increasing number of processors in today's many-core architectures has lead to new issues regarding memory management. The performance of many-core processors is often limited by the communication latency incurred in data transfers between different cores. Conventional memory allocators do not take such communication costs into account while allocating memory for application tasks at runtime. While a number of existing proposals address this issue, they result in the non-uniform utilization of available system resources. This work introduces Cache Balancer, a technique for dynamic memory allocation that addresses the limitations of state-of-the-art schemes. Cache Balancer introduces the access rate metric to measure the utilization of different cache banks in the system, and uses this at runtime to determine where memory is allocated. The technique reduces memory access latency by up to 63.4% by avoiding allocation of memory in over-utilized cache banks. Furthermore, Cache Balancer incorporates a runtime task mapper that utilizes information on the execution characteristics of tasks and the structure of the system interconnect in determining a mapping solution that results in optimal memory throughput. This results in additional memory access latency reductions of up to 14.5%, and combined execution time improvements of up to 22% as compared to state-of-the-art schemes.

[1]  Preeti Ranjan Panda,et al.  SystemC - a modeling platform supporting multiple design abstractions , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[2]  Radu Marculescu,et al.  Contention-aware application mapping for Network-on-Chip communication architectures , 2008, 2008 IEEE International Conference on Computer Design.

[3]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[6]  Sumeet S. Kumar,et al.  A 3D Network-on-Chip for stacked-die transactional chip multiprocessors using Through Silicon Vias , 2011, 2011 6th International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS).

[7]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[8]  René van Leuken,et al.  MB-LITE: A robust, light-weight soft-core implementation of the MicroBlaze architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[9]  Ahmed Hemani,et al.  Energy and Performance Model of a SPARC Leon3 Processor , 2009, 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools.

[10]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[11]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[12]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[13]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[14]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[15]  Alberto Ros,et al.  Distance-aware round-robin mapping for large NUCA caches , 2009, 2009 International Conference on High Performance Computing (HiPC).

[16]  Dongrui Fan,et al.  Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism , 2012, IEEE Micro.

[17]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  A. Michos A Novel Concurrent Validation Scheme for Hardware Transactional Memory , 2012 .

[19]  Lingjia Tang,et al.  Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[20]  Sangyeun Cho,et al.  SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.