论文信息 - Rusty: Runtime Interference-Aware Predictive Monitoring for Modern Multi-Tenant Systems

Rusty: Runtime Interference-Aware Predictive Monitoring for Modern Multi-Tenant Systems

Modern micro-service and container-based cloud-native applications have leveraged multi-tenancy as a first class system design concern. The increasing number of co-located services/workloads into server facilities stresses resource availability and system capability in an unconventional and unpredictable manner. To efficiently manage resources in such dynamic environments, run-time observability and forecasting are required to capture workload sensitivities under differing interference effects, according to applied co-location scenarios. While several research efforts have emerged on interference-aware performance modelling, they are usually applied at a very coarse-grained manner e.g., estimating the overall performance degradation of an application, thus failing to effectively quantify, predict or provide educated insights on the impact of continuous runtime interference on per-resource allocations. In this paper, we present Rusty, a predictive monitoring system that leverages the power of Long Short-Term Memory networks to enable fast and accurate runtime forecasting of key performance metrics and resource stresses of cloud-native applications under interference. We evaluate Rusty under a diverse set of interference scenarios for a plethora of representative cloud workloads, showing that Rusty i) achieves extremely high prediction accuracy, average <inline-formula><tex-math notation="LaTeX">$R^2$</tex-math><alternatives><mml:math><mml:msup><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math><inline-graphic xlink:href="masouros-ieq1-3013948.gif"/></alternatives></inline-formula> value of 0.98, ii) enables very deep prediction horizons retaining high accuracy, e.g., <inline-formula><tex-math notation="LaTeX">$R^2$</tex-math><alternatives><mml:math><mml:msup><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math><inline-graphic xlink:href="masouros-ieq2-3013948.gif"/></alternatives></inline-formula> of around 0.99 for a horizon of 1 sec ahead and around 0.94 for an horizon of 5 sec ahead, while iii) satisfying, at the same time, the strict latency constraints required to make Rusty practical for continuous predictive monitoring at runtime.

[1] Comparing Program Phase Detection Techniques , 2003, MICRO.

[2] Jack J. Dongarra,et al. Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[3] Bin Sun,et al. CounterMiner: Mining Big Performance Data from Hardware Counters , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Bowen Zhou,et al. Pythia: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads , 2018, Middleware.

[7] Ion Stoica,et al. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[8] Christoforos E. Kozyrakis,et al. Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[9] Razvan Pascanu,et al. How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[10] Gbadebo Ayoade,et al. A Survey on Hypervisor-Based Monitoring , 2015, ACM Comput. Surv..

[11] Ronald G. Dreslinski,et al. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[12] Christina Delimitrou,et al. Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[13] Avi Mendelson,et al. Deep-dive analysis of the data analytics workload in CloudSuite , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[14] Eric A. Brewer,et al. Borg, Omega, and Kubernetes , 2016, ACM Queue.

[15] Yuan He,et al. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[16] Eduard Ayguadé,et al. Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[17] Gu-Yeon Wei,et al. Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18] Margaret Martonosi,et al. Phase characterization for power: evaluating control-flow-based and event-counter-based techniques , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[19] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .

[20] Rahul Khanna,et al. RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[21] Xiao Yu,et al. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[22] Chunjie Luo,et al. Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[23] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24] Yogesh D. Barve,et al. FECBench: A Holistic Interference-aware Approach for Application Performance Modeling , 2019, 2019 IEEE International Conference on Cloud Engineering (IC2E).

[25] Li Shen,et al. PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[26] Yuqing Zhu,et al. BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[27] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28] Wentong Cai,et al. GAugur: Quantifying Performance Interference of Colocated Games for Improving Resource Utilization in Cloud Gaming , 2019, HPDC.

[29] Christoforos E. Kozyrakis,et al. Learning Memory Access Patterns , 2018, ICML.

[30] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[31] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[32] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33] Nam Sung Kim,et al. SleepScale: Runtime joint speed scaling and sleep states management for power efficient data centers , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34] Stijn Eyerman,et al. Per-thread cycle accounting in multicore processors , 2013, TACO.

[35] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[36] Jing Guo,et al. Who Limits the Resource Efficiency of My Datacenter: An Analysis of Alibaba Datacenter Traces , 2019, 2019 IEEE/ACM 27th International Symposium on Quality of Service (IWQoS).

[37] Christoforos E. Kozyrakis,et al. AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[38] Mattan Erez,et al. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.

[39] Ravi Iyer,et al. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40] Tipp Moseley,et al. Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[41] Christina Delimitrou,et al. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.

[42] Bin Li,et al. Dynamo: Facebook's Data Center-Wide Power Management System , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43] Simon Fraser. User-level scheduling on NUMA multicore systems under Linux , 2011 .

[44] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[45] Lingjia Tang,et al. Enabling fair pricing on high performance computer systems with node sharing , 2014, HiPC 2014.

[46] Babak Falsafi,et al. Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[48] Aman Kansal,et al. Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[49] Christina Delimitrou,et al. iBench: Quantifying interference for datacenter applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[50] Onur Mutlu,et al. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[51] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[52] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[53] Yiorgos Makris,et al. Workload characterization and prediction: A pathway to reliable multi-core systems , 2015, 2015 IEEE 21st International On-Line Testing Symposium (IOLTS).

[54] Lingjia Tang,et al. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks , 2019, EuroSys.

[55] Sherief Reda,et al. Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56] Henry Hoffmann,et al. ESP: A Machine Learning Approach to Predicting Application Interference , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[57] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[58] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[59] Dirk Merkel,et al. Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[60] Cristinel Ababei,et al. Investigation of LSTM based prediction for dynamic energy management in chip multiprocessors , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[61] Martin Schulz,et al. Enabling fair pricing on high performance computer systems with node sharing , 2014, Sci. Program..

[62] Alexandra Fedorova,et al. Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.

[63] Feifei Li,et al. ATOM: Efficient Tracking, Monitoring, and Orchestration of Cloud Resources , 2017, IEEE Transactions on Parallel and Distributed Systems.

[64] Noel De Palma,et al. Online metrics prediction in monitoring systems , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).