Decoupled query optimization for federated database systems

We study the problem of query optimization in federated relational database systems. The nature of federated databases explicitly decouples many aspects of the optimization process, often making it imperative for the optimizer to consult underlying data sources while doing cost-based optimization. This not only increases the cost of optimization, but also changes the trade-offs involved in the optimization process significantly. The dominant cost in the decoupled optimization process is the "cost of costing" that traditionally has been considered insignificant. The optimizer can only afford a few rounds of messages to the underlying data sources and hence the optimization techniques in this environment must be geared toward gathering all the required cost information with minimal communication. In this paper, we explore the design space for a query optimizer in this environment and demonstrate the need for decoupling various aspects of the optimization process. We present minimum-communication decoupled variants of various query optimization techniques, and discuss tradeoffs in their performance in this scenario. We have implemented these techniques in the Cohera federated database system and our experimental results, somewhat surprisingly, indicate that a simple two-phase optimization scheme performs fairly well as long as the physical database design is known to the optimizer, though more aggressive algorithms are required otherwise.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[3]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[4]  Elisa Bertino,et al.  Research Direction in Query Optimization at the University of Maryland. , 1982 .

[5]  Clement T. Yu,et al.  Optimization of Distributed Tree Queries , 1984, J. Comput. Syst. Sci..

[6]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[7]  C. Batini,et al.  A comparative analysis of methodologies for database schema integration , 1986, CSUR.

[8]  M. Carey,et al.  Load Balancing in a Locally Distributed Database System , 1986, SIGMOD Conference.

[9]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[10]  A. Sheth Federated database systems for managing distributed, heterogeneous, and autonomous databases , 1990, CSUR.

[11]  Yannis E. Ioannidis,et al.  Randomized algorithms for optimizing large join queries , 1990, SIGMOD '90.

[12]  Guy M. Lohman,et al.  Measuring the Complexity of Join Enumeration in Query Optimization , 1990, VLDB.

[13]  Yannis E. Ioannidis,et al.  Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization , 1991, SIGMOD '91.

[14]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[15]  Sumit Ganguly,et al.  Query optimization for parallel execution , 1992, SIGMOD '92.

[16]  Weimin Du,et al.  Query Optimization in a Heterogeneous DBMS , 1992, VLDB.

[17]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[18]  Patrick Valduriez,et al.  On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces , 1993, VLDB.

[19]  Sumit Ganguly,et al.  Parametric Distributed Query Optimization based on Load Conditions , 1994, COMAD.

[20]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.

[21]  Michael Stonebraker,et al.  An economic paradigm for query processing and data migration in Mariposa , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[22]  Per-Åke Larson,et al.  A query sampling method for estimating local cost parameters in a multidatabase system , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[23]  G. Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[24]  Jennifer Widom,et al.  Information translation, mediation, and mosaic-based browsing in the TSIMMIS system , 1995, SIGMOD '95.

[25]  Tadeusz Morzy,et al.  Distributed Query Optimization in Loosly Coupled Multidatabase Systems , 1995, ICDT.

[26]  Rajeev Motwani,et al.  Scheduling problems in parallel query optimization , 1995, PODS '95.

[27]  Asuman Dogac,et al.  Dynamic query optimization on a distributed object management platform , 1996, CIKM '96.

[28]  Laura M. Haas,et al.  The Garlic project , 1996, SIGMOD '96.

[29]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[30]  Timos K. Sellis,et al.  Parametric query optimization , 1992, The VLDB Journal.

[31]  Philippe Bonnet,et al.  The distributed information search component (Disco) and the World Wide Web , 1997, SIGMOD '97.

[32]  Minos N. Garofalakis,et al.  Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources , 1997, VLDB.

[33]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[34]  Jihad Boulos Analytical Models and Neural Networks for Query Cost Evaluation , 1997, NGITS.

[35]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[36]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[37]  Joseph M. Hellerstein,et al.  CONTROL: continuous output and navigation technology with refinement on-line , 1998, SIGMOD '98.

[38]  Michael Stonebraker,et al.  Interoperability, Distributed Applications and Distributed Databases: The Virtual Table Interface , 1998, IEEE Data Eng. Bull..

[39]  Sumit Ganguly,et al.  Design and Analysis of Parametric Query Optimization Algorithms , 1998, VLDB.

[40]  D. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, ACM SIGMOD Conference.

[41]  Hamid Pirahesh,et al.  Heterogeneous query processing through SQL table functions , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[42]  Ioana Manolescu,et al.  Query optimization in the presence of limited access patterns , 1999, SIGMOD '99.

[43]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[44]  Serge Abiteboul,et al.  Tools for Data Translation and Integration , 1999, IEEE Data Eng. Bull..

[45]  Laura M. Haas,et al.  Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System , 1999, VLDB.

[46]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[47]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[48]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[49]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[50]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[51]  Donald Kossmann,et al.  Iterative dynamic programming: a new class of query optimization algorithms , 2000, TODS.

[52]  Richard L. Cole A Decision Theoretic Cost Model for Dynamic Plans. , 2000 .

[53]  Mihalis Yannakakis,et al.  Multiobjective query optimization , 2001, PODS '01.

[54]  Asuman Dogac,et al.  Multidatabase Query Optimization , 2004, Distributed and Parallel Databases.