Joint optimization of cost and coverage of query plans in data integration

Existing approaches for optimizing queries in data integration use decoupled strategies--attempting to optimize coverage and cost in two separate phases. Since sources tend to have a variety of access limitations, such phased optimization of cost and coverage can unfortunately lead to expensive planning as well as highly inefficient plans. In this paper we present techniques for joint optimization of cost and coverage of the query plans. Our algorithms search in the space of parallel query plans that support multiple sources for each subgoal conjunct. The refinement of the partial plans takes into account the potential parallelism between source calls, and the binding compatibilities between the sources included in the plan. We start by introducing and motivating our query plan representation. We then briefly review how to compute the cost and coverage of a parallel plan. Next, we provide both a System-R style query optimization algorithm as well as a greedy local search algorithm for searching in the space of such query plans. Finally we present a simulation study that demonstrates that the plans generated by our approach will be significantly better, both in terms of planning cost, and in terms of plan execution cost, compared to the existing approaches.

[1]  Ioana Manolescu,et al.  Query optimization in the presence of limited access patterns , 1999, SIGMOD '99.

[2]  Daniel S. Weld,et al.  Planning to gather inforrnation , 1996, AAAI 1996.

[3]  S. Kambhampati,et al.  Joint Optimization of Cost and Coverage of Information Gathering Plans , 2022 .

[4]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[5]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[6]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[7]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[8]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[9]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[10]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[11]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[12]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.

[13]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[14]  Calton Pu,et al.  Distributed Query Scheduling Service: An Architecture and Its Implementation , 1998, Int. J. Cooperative Inf. Syst..

[15]  Alon Y. Halevy,et al.  Efficiently ordering query plans for data integration , 1999, Proceedings 18th International Conference on Data Engineering.

[16]  Subbarao Kambhampati,et al.  Mining source coverage statistics for data integration , 2001, WIDM '01.

[17]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[18]  Jeffrey D. Ullman,et al.  Optimizing Large Join Queries in Mediation Systems , 1999, ICDT.