Response Time Reliability in Cloud Environments: An Empirical Study of n-Tier Applications at High Resource Utilization

When running mission-critical web-facing applications (e.g., electronic commerce) in cloud environments, predictable response time, e.g., specified as service level agreements (SLA), is a major performance reliability requirement. Through extensive measurements of n-tier application benchmarks in a cloud environment, we study three factors that significantly impact the application response time predictability: bursty workloads (typical of web-facing applications), soft resource management strategies (e.g., global thread pool or local thread pool), and bursts in system software consumption of hardware resources (e.g., Java Virtual Machine garbage collection). Using a set of profit-based performance criteria derived from typical SLAs, we show that response time reliability is brittle, with large response time variations (order of several seconds) depending on each one of those factors. For example, for the same workload and hardware platform, modest increases in workload burstiness may result in profit drops of more than 50%. Our results show that profitbased performance criteria may contribute significantly to the successful delimitation of performance unreliability boundaries and thus support effective management of clouds.