milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services

Modern distributed systems are often considered to be black boxes that greatly limit the potential to understand behaviors at the level of detail necessary to diagnose some of the most important types of performance problems. Recently researchers have found abnormal response time delays, one to two orders of magnitude longer than the average response time, that exist in short periods and cause economic loss for service providers. These very short bottlenecks are hard to detect due to their short life spans and their variety of possible reasons. In this paper, we propose milliScope (mScope), the first millisecond-granularity software-based resource and event monitoring for distributed systems that achieves both performance, low overhead at high frequency, and high accuracy matched with other firmware monitoring tool. More specifically, milliScope is a fine-grained monitoring framework to collaborate multiple mScopeMonitors for event and resource monitoring to reconstruct the flow of each client request and profile execution performance in a distributed system. We utilize the resource mScopeMonitors for system resource monitoring, and we develop our own event mScopeMonitors to identify the execution boundary in a lightweight, precise and systematic methodology. The semantic and syntactic of these monitoring logs with arbitrary formats are enriched by our multistage data transformation tool, mScopeDataTransformer, which unifies the diverse monitoring logs into a dynamic data warehouse, mScopeDB, for advanced analysis. We conduct several illustrative scenarios in which milliScope successfully diagnoses the response time anomalies caused by very short bottlenecks using a representative web application benchmark (RUBBoS).

[1]  Philip Levis,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 323 Quanto: Tracking Energy in Networked Embedded Systems , 2022 .

[2]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[3]  Takayuki Osogami,et al.  Optimizing system configurations quickly by guessing at the performance , 2007, SIGMETRICS '07.

[4]  Massoud Pedram,et al.  Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[5]  Calton Pu,et al.  Limitations of Load Balancing Mechanisms for N-Tier Systems in the Presence of Millibottlenecks , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[6]  Hector Garcia-Molina,et al.  Main Memory Database Systems: An Overview , 1992, IEEE Trans. Knowl. Data Eng..

[7]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[8]  Mukund Raghavachari,et al.  The deployer's problem: configuring application servers for performance and reliability , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[9]  E. F. Codd Relational database: a practical foundation for productivity , 2007 .

[10]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[11]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[12]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[15]  Frank Bellosa,et al.  Resource-conscious scheduling for energy efficiency on multicore processors , 2010, EuroSys '10.

[16]  Daniela Florescu,et al.  Rethinking cost and performance of database systems , 2009, SGMD.

[17]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[18]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[19]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[20]  Alan L. Cox,et al.  Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.

[21]  Calton Pu,et al.  Lightning in the Cloud: A Study of Transient Bottlenecks on n-Tier Web Application Performance , 2014, TRIOS.

[22]  David C. Snowdon,et al.  Koala: a platform for OS-level power management , 2009, EuroSys '09.

[23]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[24]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[25]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[26]  Jayant R. Haritsa,et al.  Commit processing in distributed real-time database systems , 1996, 17th IEEE Real-Time Systems Symposium.

[27]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[28]  Calton Pu,et al.  Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[29]  Evgenia Smirni,et al.  Injecting realistic burstiness to a traditional client-server benchmark , 2009, ICAC '09.

[30]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[31]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[32]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[33]  Feng Pan,et al.  Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications , 2007, IEEE Transactions on Parallel and Distributed Systems.

[34]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[35]  Margaret Martonosi,et al.  Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[36]  Matthew Denny,et al.  Nodose version 2.0 , 1999, SIGMOD '99.

[37]  Karl Aberer,et al.  Configuration of distributed message converter systems , 2004, Perform. Evaluation.

[38]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[39]  Christoforos E. Kozyrakis,et al.  Improving Resource Efficiency at Scale with Heracles , 2016, ACM Trans. Comput. Syst..

[40]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[41]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[42]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[43]  Joseph L. Hellerstein,et al.  ETE: a customizable approach to measuring end-to-end response times and their components in distributed systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[44]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[45]  Peter M. Chen,et al.  Free transactions with Rio Vista , 1997, SOSP.

[46]  Calton Pu,et al.  When average is not average: large response time fluctuations in n-tier systems , 2012, ICAC '12.

[47]  Yixin Diao,et al.  Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache Web server , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[48]  Michael J. Freedman,et al.  Experiences with Tracing Causality in Networked Services , 2010, INM/WREN.

[49]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[50]  Yale N. Patt,et al.  Predicting Performance Impact of DVFS for Realistic Memory Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[51]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[52]  Calton Pu,et al.  The Impact of Soft Resource Allocation on n-Tier Application Scalability , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[53]  Calton Pu,et al.  An Experimental Study of Rapidly Alternating Bottlenecks in n-Tier Applications , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[54]  David R. Cheriton,et al.  Comparing the performance of web server architectures , 2007, EuroSys '07.

[55]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[56]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[57]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[58]  Ron Kohavi,et al.  Online Experiments: Lessons Learned , 2007, Computer.

[59]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[60]  Calton Pu,et al.  IO Performance Interference among Consolidated n-Tier Applications: Sharing Is Better Than Isolation for Disks , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[61]  George Varghese,et al.  Difference engine , 2010, OSDI.

[62]  Calton Pu,et al.  Lightning in the cloud: a study of very short bottlenecks on n-tierweb application performance , 2014 .

[63]  Calton Pu,et al.  Performance Interference of Memory Thrashing in Virtualized Cloud Environments: A Study of Consolidated n-Tier Applications , 2016, 2016 IEEE 9th International Conference on Cloud Computing (CLOUD).

[64]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[65]  Jason Nieh,et al.  Understanding the management of client perceived response time , 2006, SIGMETRICS '06/Performance '06.

[66]  Mendel Rosenblum,et al.  It's Time for Low Latency , 2011, HotOS.

[67]  Hwanju Kim,et al.  TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services , 2016, ASPLOS.

[68]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[69]  Calton Pu,et al.  A Study of Long-Tail Latency in n-Tier Systems: RPC vs. Asynchronous Invocations , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[70]  Jordi Torres,et al.  Understanding tuning complexity in multithreaded and hybrid web servers , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[71]  Brian D. Noble,et al.  Bobtail: Avoiding Long Tails in the Cloud , 2013, NSDI.

[72]  Lui Sha,et al.  Online response time optimization of Apache web server , 2003, IWQoS'03.

[73]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[74]  Qingyang Wang,et al.  A study of transient bottlenecks: understanding and reducing latency long-tail problem in n-tier web applications , 2014 .

[75]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[76]  Calton Pu,et al.  Impact of DVFS on n-tier application performance , 2013, TRIOS@SOSP.

[77]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[78]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[79]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[80]  Jae-Myung Kim,et al.  A case for flash memory ssd in enterprise database applications , 2008, SIGMOD Conference.

[81]  Eric A. Brewer,et al.  USENIX Association Proceedings of HotOS IX : The 9 th Workshop on Hot Topics in Operating Systems , 2003 .