论文信息 - Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis

Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis

Despite their widespread adoption in cloud computing, multicore processors are heavily under-utilized in terms of computing resources. To avoid the potential for negative and unpredictable interference, co-location of a latency-sensitive application with others on the same multicore processor is disallowed, leaving many cores idle and causing low machine utilization. To enable co-location while providing QoS guarantees, it is challenging but important to predict performance interference between co-located applications. We observed that the performance degradation of an application can be represented as a piecewise predictor function of the aggregate pressures on shared resources from all cores. Based on this observation, we propose to adopt regression analysis to build a predictor function for an application. Furthermore, the prediction model thus obtained for an application is able to characterize its contentiousness and sensitivity. Validation using a large number of single-threaded and multi-threaded benchmarks and nine real-world datacenter applications on two different platforms shows that our approach is also precise, with an average error not exceeding 0.4 percent.

Xiaobing Feng | Jingling Xue | Huimin Cui | Jiacheng Zhao

[1] Josep Torrellas,et al. Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[2] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Björn Lisper,et al. Data caches in multitasking hard real-time systems , 2003, RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003.

[5] Björn Lisper,et al. Data cache locking for tight timing calculations , 2007, TECS.

[6] Jianjun Li,et al. Providing fairness on shared-memory multiprocessors via process scheduling , 2012, SIGMETRICS '12.

[7] Mary Lou Soffa,et al. Contention aware execution: online contention detection and response , 2010, CGO '10.

[8] Chen Ding,et al. Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10] Yale N. Patt,et al. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[11] Angela C. Sodan,et al. Predicting cache needs and cache sensitivity for applications in cloud computing on CMP servers with configurable caches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13] Xiaobing Feng,et al. An empirical model for predicting cross-core performance interference on multicore processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14] Lingjia Tang,et al. Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.

[15] Francisco J. Cazorla,et al. Multicore Resource Management , 2008, IEEE Micro.

[16] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17] Jingling Xue,et al. Query-directed adaptive heap cloning for optimizing compilers , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[18] David A. Wood,et al. IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[19] S. Kim,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[20] Lingjia Tang,et al. Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[21] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[22] Hongtao Yu,et al. Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code , 2010, CGO '10.

[23] Lingjia Tang,et al. The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24] Xiao Zhang,et al. Towards practical page coloring-based multicore cache management , 2009, EuroSys '09.

[25] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .

[26] Nathan Clark,et al. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[27] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28] David Eklov,et al. Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[30] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[31] Mary Lou Soffa,et al. DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[32] Jingling Xue,et al. On-demand dynamic summary-based points-to analysis , 2012, CGO '12.

[33] Tong Li,et al. Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[34] Xipeng Shen,et al. Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors , 2010, HiPEAC.

[35] Sangyeun Cho,et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[36] Yuxiong He,et al. The Cilkview scalability analyzer , 2010, SPAA '10.

[37] Alexei Alexandrov. Parallelization Made Easier with Intel PerformanceTuning Utility , 2007 .

[38] Lingjia Tang,et al. Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[39] Maged M. Michael. Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[40] Mahmut T. Kandemir,et al. A case for integrated processor-cache partitioning in chip multiprocessors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[41] Lingjia Tang,et al. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[43] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[44] Aamer Jaleel,et al. Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[45] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[46] Mikko H. Lipasti,et al. Redeeming IPC as a performance metric for multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[47] Lingjia Tang,et al. Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[48] Lingjia Tang,et al. Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[49] I. Jolliffe. Principal Components in Regression Analysis , 1986 .

[50] Jie Chen,et al. Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[51] Luiz André Barroso,et al. The Case for Energy-Proportional Computing , 2007, Computer.

[52] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[53] Pen-Chung Yew,et al. On mitigating memory bandwidth contention through bandwidth-aware scheduling , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[54] Donald Eugene. Farrar,et al. Multicollinearity in Regression Analysis; the Problem Revisited , 2011 .

[55] Michael D. Smith,et al. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).