DIADS: Addressing the "My-Problem-or-Yours" Syndrome with Integrated SAN and Database Diagnosis

We present DIADS, an integrated DIAgnosis tool for Databases and Storage area networks (SANs). Existing diagnosis tools in this domain have a database-only (e.g., [11]) or SAN-only (e.g., [28]) focus. DIADS is a first-of-a-kind framework based on a careful integration of information from the database and SAN subsystems; and is not a simple concatenation of database-only and SAN-only modules. This approach not only increases the accuracy of diagnosis, but also leads to significant improvements in efficiency. DIADS uses a novel combination of non-intrusive machine learning techniques (e.g., Kernel Density Estimation) and domain knowledge encoded in a new symptoms database design. The machine learning component provides core techniques for problem diagnosis from monitoring data, and domain knowledge acts as checks-and-balances to guide the diagnosis in the right direction. This unique system design enables DIADS to function effectively even in the presence of multiple concurrent problems as well as noisy data prevalent in production environments. We demonstrate the efficacy of our approach through a detailed experimental evaluation of DIADS implemented on a real data center testbed with PostgreSQL databases and an enterprise SAN.

[1]  Graham Wood,et al.  Automatic Performance Diagnosis and Tuning in Oracle , 2005, CIDR.

[2]  Surajit Chaudhuri,et al.  SQLCM: a continuous monitoring framework for relational database engines , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Gerhard Weikum,et al.  Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[4]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[5]  John Dunagan,et al.  Why did my pc suddenly slow down , 2007 .

[6]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[7]  Gregory R. Ganger,et al.  Modeling the relative fitness of storage , 2007, SIGMETRICS '07.

[8]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[9]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[10]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[11]  Aameek Singh,et al.  Why Did My Query Slow Down , 2009, CIDR.

[12]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[13]  Anil K. Goel,et al.  Towards Adaptive Costing of Database Access Methods , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[14]  Erez Zadok,et al.  Tracefs: A File System to Trace Them All , 2004, FAST.

[15]  Sandeep Uttamchandani,et al.  Genesis: A Scalable Self-Evolving Performance Management Framework for Storage Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[16]  Nikolai Joukov,et al.  Operating system profiling via latency analysis , 2006, OSDI '06.

[17]  Alan Jay Smith,et al.  A File System Tracing Package for Berkeley UNIX , 1985 .

[18]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[19]  Frederick Reiss,et al.  A characterization of the sensitivity of query optimization to storage access cost parameters , 2003, SIGMOD '03.

[20]  Benoît Dageville,et al.  Automatic SQL Tuning in Oracle 10g , 2004, VLDB.

[21]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[22]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[23]  David L. Cohn,et al.  Autonomic Computing , 2003, ISADS.

[24]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[25]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[26]  Richard Murch,et al.  Autonomic Computing , 2004 .

[27]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[28]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[29]  Chetan Gupta,et al.  Automatic Workload Management for Enterprise Data Warehouses , 2008, IEEE Data Eng. Bull..