Lowering the volatility: a practical cache allocation prediction and stability-oriented co-runner scheduling algorithms

The accurate and quantitative analysis of the cache behavior in a Chip Multi-Core (CMP) machine has long been a challenging work. So far there has been no practical way to predict the cache allocation, i.e., allocated cache size, of a running program. Lots of applications, especially those that have many interactions with the users, cache allocation should be estimated with high accuracy since its variation is closely related to the stability of system performance which is important to the efficient operation of servers and has a great influence on user experience. For these interests, this paper proposes an accurate prediction model for the allocation of the last level cache (LLC) of the co-runners. With a precise cache allocation predicted, we further implemented a performance-stability-oriented co-runner scheduling algorithm which aims to maximize the number of co-runners running in performance-stable state and minimize the performance variation of the unstable ones. We demonstrate that the proposed prediction algorithm exhibits a high accuracy with an average error of 5.7 %; and the co-runner scheduling algorithm can find the optimal solution under the specified target with a time complexity of O(n).

[1]  Jie Chen,et al.  The Complexity of Optimal Job Co-Scheduling on Chip Multiprocessors and Heuristics-Based Solutions , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  David Black-Schaffer,et al.  Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Dean M. Tullsen,et al.  Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[5]  Chen Ding,et al.  Cache Conscious Task Regrouping on Multicore Processors , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[6]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[7]  Brad Calder,et al.  Detecting phases in parallel applications on shared memory architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[8]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[9]  Francisco J. Cazorla,et al.  Predictable performance in SMT processors , 2004, CF '04.

[10]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[11]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[12]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[13]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[14]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[15]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[16]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[17]  Onur Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[18]  Aamer Jaleel,et al.  CRUISE: cache replacement and utility-aware scheduling , 2012, ASPLOS XVII.

[19]  Wang Zhiqiang,et al.  Using GPU to Accelerate Cache Simulation , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[20]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  Tor M. Aamodt,et al.  Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors , 2012, IEEE Transactions on Computers.

[22]  Yi Yang,et al.  Locality Principle Revisited: A Probability-Based Quantitative Approach , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Xipeng Shen,et al.  Exploration of the Influence of Program Inputs on CMP Co-scheduling , 2008, Euro-Par.

[24]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[25]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[26]  David Eklov,et al.  Cache Pirating: Measuring the Curse of the Shared Cache , 2011, 2011 International Conference on Parallel Processing.

[27]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).