Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis

Despite their widespread adoption in cloud computing, multicore processors are heavily under-utilized in terms of computing resources. To avoid the potential for negative and unpredictable interference, co-location of a latency-sensitive application with others on the same multicore processor is disallowed, leaving many cores idle and causing low machine utilization. To enable co-location while providing QoS guarantees, it is challenging but important to predict performance interference between co-located applications. We observed that the performance degradation of an application can be represented as a piecewise predictor function of the aggregate pressures on shared resources from all cores. Based on this observation, we propose to adopt regression analysis to build a predictor function for an application. Furthermore, the prediction model thus obtained for an application is able to characterize its contentiousness and sensitivity. Validation using a large number of single-threaded and multi-threaded benchmarks and nine real-world datacenter applications on two different platforms shows that our approach is also precise, with an average error not exceeding 0.4 percent.

[1]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[2]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Björn Lisper,et al.  Data caches in multitasking hard real-time systems , 2003, RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003.

[5]  Björn Lisper,et al.  Data cache locking for tight timing calculations , 2007, TECS.

[6]  Jianjun Li,et al.  Providing fairness on shared-memory multiprocessors via process scheduling , 2012, SIGMETRICS '12.

[7]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[8]  Chen Ding,et al.  Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[11]  Angela C. Sodan,et al.  Predicting cache needs and cache sensitivity for applications in cloud computing on CMP servers with configurable caches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13]  Xiaobing Feng,et al.  An empirical model for predicting cross-core performance interference on multicore processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14]  Lingjia Tang,et al.  Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.

[15]  Francisco J. Cazorla,et al.  Multicore Resource Management , 2008, IEEE Micro.

[16]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17]  Jingling Xue,et al.  Query-directed adaptive heap cloning for optimizing compilers , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[18]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[19]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[20]  Lingjia Tang,et al.  Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[21]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[22]  Hongtao Yu,et al.  Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code , 2010, CGO '10.

[23]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24]  Xiao Zhang,et al.  Towards practical page coloring-based multicore cache management , 2009, EuroSys '09.

[25]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[26]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[27]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  David Eklov,et al.  Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[30]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[31]  Mary Lou Soffa,et al.  DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[32]  Jingling Xue,et al.  On-demand dynamic summary-based points-to analysis , 2012, CGO '12.

[33]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[34]  Xipeng Shen,et al.  Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors , 2010, HiPEAC.

[35]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[36]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[37]  Alexei Alexandrov Parallelization Made Easier with Intel PerformanceTuning Utility , 2007 .

[38]  Lingjia Tang,et al.  Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[39]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[40]  Mahmut T. Kandemir,et al.  A case for integrated processor-cache partitioning in chip multiprocessors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[41]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[43]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[44]  Aamer Jaleel,et al.  Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[45]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[46]  Mikko H. Lipasti,et al.  Redeeming IPC as a performance metric for multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[47]  Lingjia Tang,et al.  Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[48]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[49]  I. Jolliffe Principal Components in Regression Analysis , 1986 .

[50]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[51]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[52]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[53]  Pen-Chung Yew,et al.  On mitigating memory bandwidth contention through bandwidth-aware scheduling , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[54]  Donald Eugene. Farrar,et al.  Multicollinearity in Regression Analysis; the Problem Revisited , 2011 .

[55]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).