Joint Use of Multiple Learned Statistics for Improving Online Source Selection

The autonomous and decentralized nature of available online sources prevents most existing integration systems from supporting flexible query processing that takes into account conflicting user objectives such as coverage, cost-related, or data-quality objectives. To achieve multi-objective query processing, a data integration system must be able to determine which sources are most relevant for a particular query, given the desired objectives. To do so, it must gather and use source-specific statistics. In this paper we present an approach which automatically gathers coverage and overlap statistics as well as response time statistics, and jointly uses these statistics to select relevant sources. We describe our approach and present experimental results done in the context of BibFinder that demonstrate the efficiency and effectiveness of our approach.

[1]  Felix Naumann,et al.  Completeness of integrated information sources , 2004, Inf. Syst..

[2]  Qiang Zhu,et al.  Building regression cost models for multidatabase systems , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[3]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[4]  Subbarao Kambhampati,et al.  Joint optimization of cost and coverage of query plans in data integration , 2001, CIKM '01.

[5]  Marc Despontin,et al.  Multiple Criteria Optimization: Theory, Computation, and Application, Ralph E. Steuer (Ed.). Wiley, Palo Alto, CA (1986) , 1987 .

[6]  Felix Naumann,et al.  Completeness of Information Sources , 2000 .

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[8]  Per-Åke Larson,et al.  Developing Regression Cost Models for Multidatabase Systems. , 1996 .

[9]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[10]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.

[11]  Ulrich Junker,et al.  Preference-Based Search and Multi-Criteria Optimization , 2002, Ann. Oper. Res..

[12]  Mihalis Yannakakis,et al.  Multiobjective query optimization , 2001, PODS '01.

[13]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.