Addressing characterization methods for memory contention aware co-scheduling

The ability to precisely predict how memory contention degrades performance when co-scheduling programs is critical for reaching high performance levels in cluster, grid and cloud environments. In this paper we present an overview and compare the performance of state-of-the-art characterization methods for memory aware (co-)scheduling. We evaluate the prediction accuracy and co-scheduling performance of four methods: one slowdown-based, two cache-contention based and one based on memory bandwidth usage. Both our regression analysis and scheduling simulations find that the slowdown based method, represented by Memgen, performs better than the other methods. The linear correlation coefficient $$R^2$$R2 of Memgen’s prediction is 0.890. Memgen’s preferred schedules reached 99.53 % of the obtainable performance on average. Also, the memory bandwidth usage method performed almost as well as the slowdown based method. Furthermore, while most prior work promote characterization based on cache miss rate we found it to be on par with random scheduling of programs and highly unreliable.

[1]  Andreas de Blanche,et al.  Exhaustion dominated performance: a first attempt , 2009, SAC '09.

[2]  X. Ding,et al.  A Comparative Review of Contention-Aware Scheduling Algorithms to Avoid Contention in Multicore Systems , 2012 .

[3]  Xiaodong Liu,et al.  Performance analysis of cloud computing services considering resources sharing among virtual machines , 2014, The Journal of Supercomputing.

[4]  Patricia J. Teller,et al.  Towards a cross-platform microbenchmark suite for evaluating hardware performance counter data , 2005, 2005 Richard Tapia Celebration of Diversity in Computing Conference.

[5]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[6]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[7]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David A. Padua,et al.  Compile-Time Based Performance Prediction , 1999, LCPC.

[10]  Manuel Prieto,et al.  Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.

[11]  Kevin Skadron,et al.  PRECISELY PREDICTING PERFORMANCE DEGRADATION DUE TO COLOCATING MULTIPLE EXECUTING APPLICATIONS ON A SINGLE MACHINE IS CRITICAL FOR IMPROVING UTILIZATION IN MODERN , 2012 .

[12]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[13]  Jesús Labarta,et al.  Scheduling parallel jobs on multicore clusters using CPU oversubscription , 2014, The Journal of Supercomputing.

[14]  David Eklov,et al.  Cache Pirating: Measuring the Curse of the Shared Cache , 2011, 2011 International Conference on Parallel Processing.

[15]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[16]  Fei Guo Analyzing and Managing Shared Cache in Chip Multi-Processors , 2008 .

[17]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[18]  Nectarios Koziris,et al.  Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of SMPs , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[19]  Andreas de Blanche,et al.  Method for Experimental Measurement of an Applications Memory Bus Usage , 2010, PDPTA.

[20]  Tapio Niemi,et al.  Memory-based scheduling of scientific computing clusters , 2011, The Journal of Supercomputing.

[21]  Akshat Verma,et al.  Estimating Application Cache Requirement for Provisioning Caches in Virtualized Systems , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[22]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[23]  Chao-Tung Yang,et al.  Network Bandwidth-aware job scheduling with dynamic information model for Grid resource brokers , 2008, 2008 IEEE Asia-Pacific Services Computing Conference.

[24]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[25]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[26]  Thomas Lundqvist,et al.  A methodology for estimating co-scheduling slowdowns due to memory bus contention on multicore nodes , 2014 .

[27]  David Eklov,et al.  Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[28]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[29]  Gangyong Jia,et al.  Using FOM predicting method for scheduling on Chip Multi-Processor , 2011, 2011 IEEE 3rd International Conference on Communication Software and Networks.

[30]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[31]  Linn Gustavsson Christiernin,et al.  Performance of Network Subsystems for a Technical Simulation on Linux Clusters , 2005, IASTED PDCS.

[32]  Pen-Chung Yew,et al.  On mitigating memory bandwidth contention through bandwidth-aware scheduling , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Alexandra Fedorova,et al.  Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.

[34]  Alexandra Fedorova,et al.  Managing Contention for Shared Resources on Multicore Processors , 2010 .

[35]  Stefan Mankefors-Christiernin,et al.  Dual Core Efficiency for Engineering Simulation Applications , 2008, PDPTA.

[36]  Dimitrios S. Nikolopoulos,et al.  Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs , 2004, HiPC.

[37]  D. K. Arvind,et al.  Languages and Compilers for Parallel Computing , 2014, Lecture Notes in Computer Science.

[38]  Stéphane Eranian What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.

[39]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[40]  Andreas de Blanche,et al.  A Tool for Processor Dependency Characterization of HPC Applications , 2009 .