DBSherlock: A Performance Diagnostic Tool for Transactional Databases

Running an online transaction processing (OLTP) system is one of the most daunting tasks required of database administrators (DBAs). As businesses rely on OLTP databases to support their mission-critical and real-time applications, poor database performance directly impacts their revenue and user experience. As a result, DBAs constantly monitor, diagnose, and rectify any performance decays. Unfortunately, the manual process of debugging and diagnosing OLTP performance problems is extremely tedious and non-trivial. Rather than being caused by a single slow query, performance problems in OLTP databases are often due to a large number of concurrent and competing transactions adding up to compounded, non-linear effects that are difficult to isolate. Sudden changes in request volume, transactional patterns, network traffic, or data distribution can cause previously abundant resources to become scarce, and the performance to plummet. This paper presents a practical tool for assisting DBAs in quickly and reliably diagnosing performance problems in an OLTP database. By analyzing hundreds of statistics and configurations collected over the lifetime of the system, our algorithm quickly identifies a small set of potential causes and presents them to the DBA. The root-cause established by the DBA is reincorporated into our algorithm as a new causal model to improve future diagnoses. Our experiments show that this algorithm is substantially more accurate than the state-of-the-art algorithm in finding correct explanations.

[1]  Herodotos Herodotou,et al.  Xplus , 2010, Proc. VLDB Endow..

[2]  Graham Wood,et al.  Automatic Performance Diagnosis and Tuning in Oracle , 2005, CIDR.

[3]  Andreas Ritter,et al.  Structural Equations With Latent Variables , 2016 .

[4]  Dan Suciu,et al.  Bringing Provenance to Its Full Potential Using Causal Reasoning , 2011, TaPP.

[5]  Li Wei,et al.  HOT: Hypergraph-Based Outlier Test for Categorical Data , 2003, PAKDD.

[6]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part I: Causes , 2000, The British Journal for the Philosophy of Science.

[7]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[8]  Lei Cao,et al.  Interactive Outlier Exploration in Big Data Streams , 2014, Proc. VLDB Endow..

[9]  Qi Zhao,et al.  Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[10]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[11]  Manish Kumar,et al.  PerfAugur: Robust diagnostics for performance anomalies in cloud services , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[12]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[13]  Joseph Y. Halpern,et al.  Causes and explanations: A structural-model approach , 2000 .

[14]  Dan Suciu,et al.  PerfXplain: Debugging MapReduce Job Performance , 2012, Proc. VLDB Endow..

[15]  Darcy G. Benoit,et al.  Automatic Diagnosis of Performance Problems in Database Management Systems , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[16]  T. Asano,et al.  ENTROPY , RELATIVE ENTROPY , AND MUTUAL INFORMATION , 2008 .

[17]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[18]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[19]  Carlo Curino,et al.  DBSeer: Resource and Performance Prediction for Building a Next Generation Database Cloud , 2013, CIDR.

[20]  Ippokratis Pandis,et al.  TPC-E vs. TPC-C: characterizing the new TPC-E benchmark via an I/O comparison study , 2011, SGMD.

[21]  Sunita Sarawagi,et al.  Explaining Differences in Multidimensional Aggregates , 1999, VLDB.

[22]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part II: Explanations , 2001, The British Journal for the Philosophy of Science.

[23]  Hongjun Lu,et al.  Finding centric local outliers in categorical/numerical spaces , 2006, Knowledge and Information Systems.

[24]  Benoît Dageville,et al.  Self-Tuning for SQL Performance in Oracle Database 11g , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[26]  Eamonn J. Keogh,et al.  Finding Unusual Medical Time-Series Subsequences: Algorithms and Applications , 2006, IEEE Transactions on Information Technology in Biomedicine.

[27]  Aameek Singh,et al.  Why Did My Query Slow Down , 2009, CIDR.

[28]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[29]  Barzan Mozafari,et al.  CliffGuard: A Principled Framework for Finding Robust Database Designs , 2015, SIGMOD Conference.

[30]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[31]  D. Francis An introduction to structural equation models. , 1988, Journal of clinical and experimental neuropsychology.

[32]  Alfons Kemper,et al.  Adaptive quality of service management for enterprise services , 2008, TWEB.

[33]  Gjergji Kasneci,et al.  Temporal Anomaly Detection in Business Processes , 2014, BPM.

[34]  Li Wei,et al.  Assumption-Free Anomaly Detection in Time Series , 2005, SSDBM.

[35]  Haixun Wang,et al.  Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.

[36]  Barzan Mozafari,et al.  DBSeer: Pain-free Database Administration through Workload Intelligence , 2015, Proc. VLDB Endow..

[37]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[38]  Jian Li,et al.  Sensitivity analysis and explanations for robust query evaluation in probabilistic databases , 2011, SIGMOD '11.

[39]  Alex Delis,et al.  Outlier-Aware Data Aggregation in Sensor Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[40]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[41]  Carlo Curino,et al.  Performance and resource modeling in highly-concurrent OLTP workloads , 2013, SIGMOD '13.