Application-to-core mapping policies to reduce memory interference in multi-core systems

How applications running on a many-core system are mapped to cores largely determines the interference between these applications in critical shared resources. This paper proposes application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that both ideas significantly improve system throughput, fairness and interconnect power efficiency.

[1]  Shashi Kumar,et al.  A two-step genetic algorithm for mapping task graphs to a network on chip architecture , 2003, Euromicro Symposium on Digital System Design, 2003. Proceedings..

[2]  Tong Li,et al.  Operating system support for overlapping-ISA heterogeneous multi-core architectures , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[3]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[4]  Natalie D. Enright Jerger,et al.  DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[5]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[6]  Onur Mutlu,et al.  A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips , 2012, IEEE Micro.

[7]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[9]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[11]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[12]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[13]  Krste Asanovic,et al.  Reducing power density through activity migration , 2003, ISLPED '03.

[14]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[15]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[16]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[18]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[19]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[21]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[22]  Chris Fallin,et al.  Next generation on-chip networks: what kind of congestion control do we need? , 2010, Hotnets-IX.

[23]  Srinivasan Murali,et al.  Bandwidth-constrained mapping of cores onto NoC architectures , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[24]  Reetuparna Das,et al.  Application-to-Core Mapping Policies to Reduce Interference in On-Chip Networks , 2011 .

[25]  Tajana Simunic,et al.  Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors , 2009, SIGMETRICS '09.

[26]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[27]  Radu Marculescu,et al.  Energy-aware mapping for tile-based NoC architectures under performance constraints , 2003, ASP-DAC '03.

[28]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[31]  Onur Mutlu,et al.  Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[33]  Murali Annavaram,et al.  Mitigating Amdahl's Law through EPI Throttling , 2005, ISCA 2005.

[34]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Onur Mutlu,et al.  Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[36]  Faye A. Briggs,et al.  A study of performance impact of memory controller features in multi-processor server environment , 2004, WMPI '04.

[37]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[38]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[39]  Song Jiang,et al.  CLOCK-Pro: An Effective Improvement of the CLOCK Replacement , 2005, USENIX ATC, General Track.

[40]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[41]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[42]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[43]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[44]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.