Cluster-and-conquer: hierarchical multi-metric query processing in large-scale database federations

The federated database architecture has been introduced to maintain the autonomy of individual data sources yet accomplish federated task for diverse applications from traditional enterprises to computational sciences. We identify two challenging problems of query optimization in large-scale database federation systems. First, run-time conditions of data sources have a profound effect on the performance of database federations, yet the distributed environment of database federations makes it prohibitively expensive for the optimizer to gather rapidly fluctuating run-time conditions from remote data sources. Second, large-scale database federation systems are often widely distributed and built on heterogeneous networks, thus efficiently utilizing network resources is of ever increasing importance for query scheduling. In this paper, we propose to exploit the clustered hierarchical structure of database federations to solve these two problems. Our Cluster-and-Conquer strategy coordinates hierarchical clusters of data sources to optimize and process queries cooperatively. Within each cluster we employ an I/O-bound cost model with run-time conditions being accessible with relatively little delay. While among clusters a network-bound cost model is instead utilized to capture the network heterogeneity and optimize the query plans for efficient network utilization. The experimental study on the prototype database federation system with real-world network settings shows the effectiveness of our Cluster-and-Conquer strategy for scheduling data-intensive queries, as well as demonstrates the performance benefits of our proposed strategies over existing state-of-art solutions.

[1]  Hans-Arno Jacobsen,et al.  CORBA-based interoperable geographic information systems , 1998 .

[2]  José A. Blakeley,et al.  Distributed/heterogeneous query processing in Microsoft SQL server , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[4]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[5]  David J. DeWitt,et al.  Data placement in shared-nothing parallel database systems , 1997, The VLDB Journal.

[6]  Laura M. Haas,et al.  Data integration through database federation , 2002, IBM Syst. J..

[7]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[8]  Donald Kossmann,et al.  Iterative dynamic programming: a new class of query optimization algorithms , 2000, TODS.

[9]  Minos N. Garofalakis,et al.  Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources , 1997, VLDB.

[10]  Asuman Dogac,et al.  Multidatabase Query Optimization , 2004, Distributed and Parallel Databases.

[11]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[12]  Timos K. Sellis,et al.  Parametric query optimization , 1992, The VLDB Journal.

[13]  Alexander S. Szalay,et al.  SkyQuery: A Web Service Approach to Federate Databases , 2003, CIDR.

[14]  Beng Chin Ooi,et al.  Multidatabase query optimization: issues and solutions , 1993, Proceedings RIDE-IMS `93: Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems.

[15]  Elisa Bertino,et al.  Research Direction in Query Optimization at the University of Maryland. , 1982 .

[16]  Ming-Syan Chen,et al.  On the Complexity of Distributed Query Optimization , 1996, IEEE Trans. Knowl. Data Eng..

[17]  Alon Y. Halevy,et al.  Adapting to source properties in processing data integration queries , 2004, SIGMOD '04.

[18]  Per-Åke Larson,et al.  Solving Local Cost Estimation Problem for Global Query Optimization in Multidatabase Systems , 1998, Distributed and Parallel Databases.

[19]  Elke A. Rundensteiner,et al.  Revisiting Pipelined Parallelism in Multi-Join Query Processing , 2005, VLDB.

[20]  Michael Stonebraker,et al.  Mariposa: a wide-area distributed database system , 1996, The VLDB Journal.

[21]  Laura M. Haas,et al.  Garlic: a new flavor of federated query processing for DB2 , 2002, SIGMOD '02.

[22]  Clement T. Yu,et al.  Distributed query processing a multiple database system , 1989, IEEE J. Sel. Areas Commun..

[23]  Andreas Terzis,et al.  Network-Aware Join Processing in Global-Scale Database Federations , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Vladimir Zadorozhny,et al.  Efficient evaluation of queries in a mediator for WebSources , 2002, SIGMOD '02.

[25]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[26]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[27]  Joseph M. Hellerstein,et al.  Decoupled query optimization for federated database systems , 2002, Proceedings 18th International Conference on Data Engineering.

[28]  Laura M. Haas,et al.  Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System , 1999, VLDB.

[29]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[30]  Pierangela Samarati,et al.  Providing Security and Interoperation of Heterogeneous Systems , 2004, Distributed and Parallel Databases.

[31]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.

[32]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[33]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Tadeusz Morzy,et al.  Distributed Query Optimization in Loosly Coupled Multidatabase Systems , 1995, ICDT.

[35]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.