Abstracting Multi-Core Topologies with MCTOP

Portability and efficiency are usually antagonists in multi-core computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too. We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to automatically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements. MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort.

[1]  Rachid Guerraoui,et al.  Unlocking Energy , 2016, USENIX Annual Technical Conference.

[2]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  Marc Shapiro,et al.  A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[4]  John M. Mellor-Crummey,et al.  Contention-conscious, locality-preserving locks , 2016, PPoPP.

[5]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[6]  Takeshi Ogasawara NUMA-aware memory manager with dominant-thread-based copying GC , 2009, OOPSLA 2009.

[7]  Robert Morris,et al.  Optimizing MapReduce for Multicore Architectures , 2010 .

[8]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[9]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[10]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[11]  Manuel Prieto,et al.  A comprehensive scheduler for asymmetric multicore systems , 2010, EuroSys '10.

[12]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[13]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[14]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[15]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[16]  Sabela Ramos,et al.  Machine-Aware Atomic Broadcast Trees for Multicores , 2016, OSDI.

[17]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[18]  Eitan Frachtenberg,et al.  Power and performance evaluation of Memcached on the TILEPro64 architecture , 2012, Sustain. Comput. Informatics Syst..

[19]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[20]  Michael Stumm,et al.  Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system , 1999, OSDI '99.

[21]  Wolfgang E. Nagel,et al.  Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[23]  Norman May,et al.  Scaling Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-aware Data and Task Placement , 2015, Proc. VLDB Endow..

[24]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[25]  Changwoo Min,et al.  Scalability in the Clouds!: A Myth or Reality? , 2015, APSys.

[26]  Michael Garland,et al.  Architecture-Adaptive Code Variant Tuning , 2016, ASPLOS.

[27]  Vivien Quéma,et al.  Multicore Locks: The Case Is Not Closed Yet , 2016, USENIX Annual Technical Conference.

[28]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[29]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[30]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[31]  Thomas R. Gross,et al.  A library for portable and composable data locality optimizations for NUMA systems , 2015, PPOPP.

[32]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[33]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[34]  Timothy L. Harris,et al.  Callisto-RTS: Fine-Grain Parallel Loops , 2015, USENIX Annual Technical Conference.

[35]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[36]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[37]  Gustavo Alonso,et al.  Deployment of Query Plans on Multicores , 2014, Proc. VLDB Endow..

[38]  Ippokratis Pandis,et al.  OLTP on Hardware Islands , 2012, Proc. VLDB Endow..

[39]  Peter Sanders,et al.  MCSTL: the multi-core standard template library , 2007, PPOPP.

[40]  Frank Bellosa,et al.  Resource-conscious scheduling for energy efficiency on multicore processors , 2010, EuroSys '10.

[41]  David J. Brown,et al.  Toward energy-efficient computing , 2010, CACM.

[42]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[43]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[44]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[45]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[46]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[47]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[48]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[49]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[50]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[51]  Adrian Schüpbach,et al.  Your computer is already a distributed system. Why isn't your OS? , 2009, HotOS.

[52]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[53]  Timothy Roscoe,et al.  Decoupling Cores, Kernels, and Operating Systems , 2014, OSDI.

[54]  Nhan Nguyen,et al.  NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[55]  Takeshi Ogasawara NUMA-aware memory manager with dominant-thread-based copying GC , 2009, OOPSLA.

[56]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[57]  Adrian Schüpbach,et al.  Embracing diversity in the Barrelfish manycore operating system , 2008 .

[58]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[59]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[60]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[61]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[62]  Eddie Kohler,et al.  Fast Databases with Fast Durability and Recovery Through Multicore Parallelism , 2014, OSDI.

[63]  Babak Falsafi,et al.  Shore-MT: a scalable storage manager for the multicore era , 2009, EDBT '09.

[64]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[65]  Hiroshi Inoue,et al.  SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures , 2015, Proc. VLDB Endow..