Frequency-Based Coverage Statistics Mining for Data Integration

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of data sources for efficient query processing. Gathering and storing the required statistics presents many challenges, not the least of which is controlling the amount of statistics learned. In this paper we describe a novel application of data mining technology for statistics gathering. Specifically, we introduce a statistics mining approach which efficiently discovers frequently accessed query classes, and learns and stores statistics only with respect to these classes. We describe the details of our method, and present experimental results demonstrating the efficiency and effectiveness of our approach.

[1]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[2]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[3]  Subbarao Kambhampati,et al.  Mining coverage statistics for websource selection in a mediator , 2002, CIKM '02.

[4]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[5]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[6]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[7]  L. C. Green,et al.  Georgia , 1958 .

[8]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[9]  Subbarao Kambhampati,et al.  Joint optimization of cost and coverage of query plans in data integration , 2001, CIKM '01.

[10]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[11]  Per-Åke Larson,et al.  Developing Regression Cost Models for Multidatabase Systems. , 1996 .

[12]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[13]  Clement T. Yu,et al.  Concept hierarchy based text database categorization in a metasearch engine environment , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[14]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[15]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.

[16]  Qiang Zhu,et al.  Building regression cost models for multidatabase systems , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..