Database query optimisation based on measures of regret

The query optimiser in a database management system (DBMS) is responsible for �nding a good order in which to execute the operators in a given query. However, in practice the query optimiser does not usually guarantee to �nd the best plan. This is often due to the non-availability of precise statistical data or inaccurate assumptions made by the optimiser. In this thesis we propose a robust approach to logical query optimisation that takes into account the unreliability in database statistics during the optimisation process. In particular, we study the ordering problem for selection operators and for join operators, where selectivities are modelled as intervals rather than exact values. As a measure of optimality, we use a concept from decision theory called minmax regret optimisation (MRO). When using interval selectivities, the decision problem for selection operator ordering turns out to be NP-hard. After investigating properties of the problem and identifying special cases which can be solved in polynomial time, we develop a novel heuristic for solving the general selection ordering problem in polynomial time. Experimental evaluation of the heuristic using synthetic data, the Star Schema Benchmark and real-world data sets shows that it outperforms other heuristics (which take an optimistic, pessimistic or midpoint approach) and also produces plans whose regret is on average very close to optimal. The general join ordering problem is known to be NP-hard, even for exact selectivities. So, for interval selectivities, we restrict our investigation to sets of join operators which form a chain and to plans that correspond to left-deep join trees. We investigate properties of the problem and use these, along with ideas from the selection ordering heuristic and other algorithms in the literature, to develop a polynomial-time heuristic tailored for the join ordering problem. Experimental evaluation of the heuristic shows that, once again, it performs better than the optimistic, pessimistic and midpoint heuristics. In addition, the results show that the heuristic produces plans whose regret is on average even closer to the optimal than for selection ordering.

[1]  Noga Alon,et al.  Approximation schemes for scheduling , 1997, SODA '97.

[2]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[3]  Vladimir Zadorozhny,et al.  Efficient evaluation of queries in a mediator for WebSources , 2002, SIGMOD '02.

[4]  Christian S. Jensen,et al.  Efficiently adapting graphical models for selectivity estimation , 2012, The VLDB Journal.

[5]  Pedro Furtado,et al.  Summary grids: building accurate multidimensional histograms , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[6]  Serge A. Plotkin Competitive Routing of Virtual Circuits in ATM Networks , 1995, IEEE J. Sel. Areas Commun..

[7]  Chung-Cheng Lu,et al.  Robust scheduling on a single machine to minimize total flow time , 2012, Comput. Oper. Res..

[8]  Jian Yang,et al.  On the Robust Single Machine Scheduling Problem , 2002, J. Comb. Optim..

[9]  M. R. Garey,et al.  Optimal task sequencing with precedence constraints , 1973, Discrete Mathematics.

[10]  Carolyn E. Begg,et al.  Database Systems: A Practical Approach to Design, Implementation and Management , 1998 .

[11]  David J. DeWitt,et al.  Complex query processing in multiprocessor database machines , 1990 .

[12]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[13]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[14]  Mark S. Daskin,et al.  α-reliable p-minimax regret: a new model for strategic facility location modeling , 1997 .

[15]  Clyde L. Monma,et al.  Sequencing with Series-Parallel Precedence Constraints , 1979, Math. Oper. Res..

[16]  Alfredo Candia-Véjar,et al.  Minmax regret combinatorial optimization problems: an Algorithmic Perspective , 2011, RAIRO Oper. Res..

[17]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[18]  Eduardo Conde,et al.  On a constant factor approximation for minmax regret problems using a symmetry point scenario , 2012, Eur. J. Oper. Res..

[19]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[20]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[21]  Peter J. Haas,et al.  Statistical Learning Techniques for Costing XML Queries , 2005, VLDB.

[22]  Adam Kasperski,et al.  Discrete Optimization with Interval Data - Minmax Regret and Fuzzy Approach , 2008, Studies in Fuzziness and Soft Computing.

[23]  A Gerodimos,et al.  Robust Discrete Optimization and its Applications , 1996, J. Oper. Res. Soc..

[24]  Surajit Chaudhuri,et al.  Towards a robust query optimizer: a principled and practical approach , 2005, SIGMOD '05.

[25]  Wei Hong,et al.  Exploiting correlated attributes in acquisitional query processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[26]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[27]  Wolfgang Lehner,et al.  Cardinality estimation using sample views with quality assurance , 2007, SIGMOD '07.

[28]  S. Hansson Decision Theory a Brief Introduction Contents , 1994 .

[29]  Adam Kasperski,et al.  A 2-approximation algorithm for interval data minmax regret sequencing problems with the total flow time criterion , 2008, Oper. Res. Lett..

[30]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD '00.

[31]  Joseph Y. Halpern,et al.  Least expected cost query optimization: an exercise in utility , 1999, PODS.

[32]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD 2000.

[33]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[34]  Guido Moerkotte,et al.  Constructing Optimal Bushy Processing Trees for Join Queries is NP-hard , 1996 .

[35]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[36]  Ali Allahverdi,et al.  Scheduling on M parallel machines subject to random breakdowns to minimize expected mean flow time , 1994 .

[37]  Igor Averbakh,et al.  Complexity of minimizing the total flow time with interval data and minmax regret criterion , 2006, Discret. Appl. Math..

[38]  Sven Helmer,et al.  Ordering Selection Operators Using the Minmax Regret Rule , 2015, ArXiv.

[39]  Martin Peterson,et al.  An Introduction to Decision Theory , 2009 .

[40]  Sumit Ganguly,et al.  Design and Analysis of Parametric Query Optimization Algorithms , 1998, VLDB.

[41]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[42]  Daniel Vanderpooten,et al.  Min-max and min-max regret versions of combinatorial optimization problems: A survey , 2009, Eur. J. Oper. Res..

[43]  Jan Karel Lenstra,et al.  Complexity of machine scheduling problems , 1975 .

[44]  Oren Etzioni,et al.  Efficient information gathering on the Internet , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[45]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[46]  Didier Dubois,et al.  Decision-theoretic foundations of qualitative possibility theory , 2001, Eur. J. Oper. Res..

[47]  Jayant R. Haritsa,et al.  On the Production of Anorexic Plan Diagrams , 2007, VLDB.

[48]  Leonard J. Savage,et al.  The Theory of Statistical Decision , 1951 .

[49]  Igor Averbakh Minmax regret linear resource allocation problems , 2004, Oper. Res. Lett..

[50]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[51]  Toshihide Ibaraki,et al.  On the optimal nesting order for computing N-relational joins , 1984, TODS.

[52]  P. Muliere,et al.  Optimal sequential testing for an inverse Gaussian process , 2016 .

[53]  Daniel Vanderpooten,et al.  Approximating Min-Max (Regret) Versions of Some Polynomial Problems , 2006, COCOON.

[54]  Hamid Pirahesh,et al.  Robust query processing through progressive optimization , 2004, SIGMOD '04.

[55]  Amihai Motro,et al.  Management of uncertainty in database systems , 1995 .

[56]  Jennifer Widom,et al.  Operator placement for in-network stream query processing , 2005, PODS.

[57]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[58]  Igor Averbakh Computing and minimizing the relative regret in combinatorial optimization with interval data , 2005, Discret. Optim..

[59]  Yuri N. Sotskov,et al.  Minimizing total weighted flow time under uncertainty using dominance and a stability box , 2012, Comput. Oper. Res..

[60]  Divesh Srivastava,et al.  Estimating the selectivity of approximate string queries , 2007, TODS.

[61]  David J. DeWitt,et al.  Proactive re-optimization , 2005, SIGMOD '05.

[62]  Jerzy Józefczyk,et al.  Solution algorithms for unrelated machines minmax regret scheduling problem with interval processing times and the total flow time criterion , 2014, Ann. Oper. Res..

[63]  Norman May,et al.  Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA , 2014, SIGMOD Conference.

[64]  Jayant R. Haritsa,et al.  Plan Bouquets , 2016, ACM Trans. Database Syst..

[65]  Ravi Sethi,et al.  The Complexity of Flowshop and Jobshop Scheduling , 1976, Math. Oper. Res..

[66]  Adam Kasperski,et al.  An approximation algorithm for interval data minmax regret combinatorial optimization problems , 2006, Inf. Process. Lett..

[67]  Yannis E. Ioannidis,et al.  Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization , 1991, SIGMOD '91.

[68]  Panagiotis Kouvelis,et al.  Robust scheduling to hedge against processing time uncertainty in single-stage production , 1995 .

[69]  Sven Helmer,et al.  Ordering Selection Operators Under Partial Ignorance , 2015, CIKM.

[70]  Xiaotie Deng,et al.  Competitive Analysis of Network Load Balancing , 1997, J. Parallel Distributed Comput..

[71]  Igor Averbakh,et al.  On the complexity of a class of combinatorial optimization problems with uncertainty , 2001, Math. Program..

[72]  Ting-Yu Chen,et al.  Optimism and Pessimism in Decision Making Based on Intuitionistic Fuzzy Sets , 2008 .

[73]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[74]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[75]  A. Volgenant,et al.  Improved polynomial algorithms for robust bottleneck problems with interval data , 2010, Comput. Oper. Res..

[76]  Max Crochemore,et al.  Algorithms and Theory of Computation Handbook , 2010 .

[77]  Joseph Y. Halpern,et al.  Least expected cost query optimization: what can we expect? , 2002, PODS.

[78]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[79]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[80]  Jeffrey F. Naughton,et al.  Uncertainty Aware Query Execution Time Prediction , 2014, Proc. VLDB Endow..

[81]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[82]  Harvey J. Greenberg,et al.  Digital Object Identifier (DOI) 10.1007/s10107-005-0689-x , 2005 .

[83]  Stratis Viglas,et al.  Write-limited sorts and joins for persistent memory , 2014, Proc. VLDB Endow..

[84]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[85]  A. Goldberg,et al.  A heuristic improvement of the Bellman-Ford algorithm , 1993 .

[86]  Nicolas Bruno,et al.  Advanced Join Strategies for Large-Scale Distributed Computation , 2014, Proc. VLDB Endow..

[87]  David Harel,et al.  Structure and complexity of relational queries , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[88]  Xuemin Lin,et al.  Selectivity Estimation on Streaming Spatio-Textual Data Using Local Correlations , 2014, Proc. VLDB Endow..

[89]  Ali Allahverdi,et al.  Scheduling on a two-machine flowshop subject to random breakdowns with a makespan objective function , 1995 .

[90]  Tilmann Rabl,et al.  Variations of the star schema benchmark to test the effects of data skew on query performance , 2013, ICPE '13.

[91]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[92]  Wayne E. Smith Various optimizers for single‐stage production , 1956 .

[93]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[94]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[95]  Thomas Neumann,et al.  Taking the Edge off Cardinality Estimation Errors using Incremental Execution , 2013, BTW.

[96]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[97]  Abdelkader Hameurlain,et al.  Robust Query Optimization Methods With Respect to Estimation Errors: A Survey , 2015, SGMD.

[98]  Jan Karel Lenstra,et al.  Complexity of Scheduling under Precedence Constraints , 1978, Oper. Res..

[99]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[100]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[101]  Frank Werner,et al.  The Stability Box in Interval Data for Minimizing the Sum of Weighted Completion Times , 2011, SIMULTECH.

[102]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[103]  Guy M. Lohman,et al.  Measuring the Complexity of Join Enumeration in Query Optimization , 1990, VLDB.

[104]  Leonidas Fegaras,et al.  A New Heuristic for Optimizing Large Queries , 1998, DEXA.

[105]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[106]  Sven Helmer,et al.  On the optimal ordering of maps and selections under factorization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[107]  Joseph B. Mazzola,et al.  Resource-Constrained Assignment Scheduling , 1986, Oper. Res..