Simplifying system management through automated forecasting, diagnosis, and configuration tuning

Large-scale networked computing systems are widely deployed to run business-critical applications in environments where changes are frequent. Manual management of these complex systems can be tedious and error-prone. Meanwhile, the high costs of application downtime make it critical to ensure system availability and reliability. Recent progress in monitoring tools enables system administrators to collect fine-grained data about system activity with low overhead. This data provides valuable information for system management. However, the monitoring data collected from production systems is massive in size and noisy; which makes it hard for system administrators to fully utilize this data for effective system management. This dissertation describes a data-management platform, called Fa, where system administrators can pose declarative queries over system monitoring data. Fa automatically finds fairly accurate and efficient execution plans for given queries, and returns query results in easy-to-interpret formats. Fa supports three key query types, namely, forecasting queries (for predicting or detecting performance problems), diagnosis queries (for finding the cause of performance problems), and tuning queries (for recommending changes to system configuration to resolve diagnosed problems): (a) For processing diagnosis queries, Fa constructs problem signatures from system monitoring data to identify recurrent problems and to reuse past diagnostic information. For a rare or new problem, Fa employs an anomaly-based clustering technique to generate performance baselines and to characterize the deviation from baselines to pinpoint root causes. Fa also incorporates an active-learning component that identifies diagnosis queries whose results, if provided or confirmed by system administrators, can be used to update problem signatures and to improve the accuracy and efficiency for processing future queries. (b) For processing tuning queries to resolve problems caused by system misconfiguration, Fa employs an adaptive sampling algorithm that plans experiments to efficiently identify high-impact configuration parameters and high-performance settings. These experiments bring in information—required for generating accurate query results—that is missing in the monitoring data collected so far. (c) For both one-time and continuous forecasting queries, Fa automatically searches for efficient execution plans in a large space of plans composed of data-transformation operators as well as synopsis-learning and prediction operators. Forecasting queries can be composed with diagnosis and tuning queries to enable proactive system management that avoids potential problems. We have evaluated the Fa platform with monitoring data collected from database-backed multitier services, and with synthetic data that models the noisy nature of monitoring data from production systems. Our evaluation shows that Fa's query plan selection and execution strategies provide actionable information for system management automatically, accurately, and efficiently. Critical features like reliable confidence estimates, robustness to noise, and providing supporting evidence for query results make Fa a practical and useful platform.

[1]  Thomas J. Santner,et al.  The Design and Analysis of Computer Experiments , 2003, Springer Series in Statistics.

[2]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[3]  Moisés Goldszmidt,et al.  Short term performance forecasting in enterprise systems , 2005, KDD '05.

[4]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  T. T. Osugi,et al.  Exploration-based Active Machine Learning Exploration-based Active Machine Learning , 2005 .

[6]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[7]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[8]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[9]  Graham Wood,et al.  Automatic Performance Diagnosis and Tuning in Oracle , 2005, CIDR.

[10]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[11]  Shivnath Babu,et al.  Interaction-aware prediction of business intelligence workload completion times , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[12]  Martin Arlitt,et al.  Workload Characterization of the 1998 World Cup Web Site , 1999 .

[13]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[14]  Wei Hong,et al.  Model-based approximate querying in sensor networks , 2005, The VLDB Journal.

[15]  Anthony K. H. Tung,et al.  A new approach to dynamic self-tuning of database buffers , 2008, TOS.

[16]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[17]  Surajit Chaudhuri,et al.  Compressing SQL workloads , 2002, SIGMOD '02.

[18]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[19]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[20]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[21]  Sheng Ma,et al.  Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[22]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[23]  Margo I. Seltzer,et al.  Using probabilistic reasoning to automate software tuning , 2004, SIGMETRICS '04/Performance '04.

[24]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[25]  Mohamed F. Mokbel,et al.  SARD: A statistical approach for ranking database tuning parameters , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[26]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[27]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[28]  Shivnath Babu,et al.  Empirical Comparison of Techniques for Automated Failure Diagnosis , 2008, SysML.

[29]  Ashraf Aboulnaga,et al.  Automatic virtual machine configuration for database workloads , 2008, SIGMOD Conference.

[30]  Kamesh Munagala,et al.  Processing Diagnosis Queries: A Principled and Scalable Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[31]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[32]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[33]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[34]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[35]  Shivnath Babu,et al.  Processing Forecasting Queries , 2007, VLDB.

[36]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[39]  Leonie Kohl,et al.  Fundamental Concepts in the Design of Experiments , 2000 .

[40]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[41]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[42]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[43]  Sam Lightstone,et al.  Adaptive self-tuning memory in DB2 , 2006, VLDB.

[44]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[45]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[46]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[47]  Bowei Xi,et al.  A smart hill-climbing algorithm for application server configuration , 2004, WWW '04.

[48]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[49]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[50]  Robert Freeman Oracle Database 11g New Features , 2002 .

[51]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[52]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[53]  Hongjun Lu,et al.  Mining inter-transaction associations with templates , 1999, CIKM '99.

[54]  Shivnath Babu,et al.  Proactive identification of performance problems , 2006, SIGMOD Conference.

[55]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[56]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[57]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[58]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[59]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[60]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[61]  Kailash Jayaswal Administering Data Centers: Servers, Storage, and Voice over IP , 2005 .

[62]  David A. Patterson,et al.  A Flexible Architecture for Statistical Learning and Data Mining from System Log Streams , 2004 .

[63]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[64]  C. Ireland Fundamental concepts in the design of experiments , 1964 .

[65]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[66]  Charu C. Aggarwal A Framework for Change Diagnosis of Data Streams. , 2003, SIGMOD 2003.

[67]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[68]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[69]  Gerhard Weikum,et al.  Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[70]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[71]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[72]  Shivnath Babu,et al.  Shaman: A Self-Healing Database System , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[73]  Kamesh Munagala,et al.  Cancer characterization and feature set extraction by discriminative margin clustering , 2004, BMC Bioinformatics.

[74]  Jennifer Widom,et al.  Adaptive query processing in data stream management systems , 2005 .

[75]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[76]  Benoît Dageville,et al.  Oracle's SQL Performance Analyzer , 2008, IEEE Data Eng. Bull..

[77]  Shivnath Babu,et al.  Automated Diagnosis of System Failures with Fa , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[78]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[79]  Willy Zwaenepoel,et al.  Performance and scalability of EJB applications , 2002, OOPSLA '02.

[80]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.