A frequency-based approach for mining coverage statistics in data integration

Query optimization in data integration requires source coverage and overlap statistics. Gathering and storing the required statistics presents many challenges, not the least of which is controlling the amount of statistics learned. We introduce StatMiner, a novel statistics mining approach which automatically generates attribute value hierarchies, efficiently discovers frequently accessed query classes based on the learned attribute value hierarchies, and learns statistics only with respect to these classes. We describe the details of our method, and present experimental results demonstrating the efficiency and effectiveness of our approach. Our experiments are done in the context of BibFinder, a publicly fielded bibliography mediator.

[1]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[2]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.

[4]  Per-Åke Larson,et al.  Developing Regression Cost Models for Multidatabase Systems. , 1996 .

[5]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[6]  Subbarao Kambhampati,et al.  Effectively mining and using coverage and overlap statistics for data integration , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[8]  L. C. Green,et al.  Georgia , 1958 .

[9]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[10]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[11]  Subbarao Kambhampati,et al.  BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration , 2003, VLDB.

[12]  Qiang Zhu,et al.  Building regression cost models for multidatabase systems , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[13]  HalevyAlon,et al.  MiniCon: A scalable algorithm for answering queries using views , 2001, VLDB 2001.

[14]  Clement T. Yu,et al.  Concept hierarchy based text database categorization in a metasearch engine environment , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[15]  Alon Y. Halevy,et al.  Efficiently ordering query plans for data integration , 1999, Proceedings 18th International Conference on Data Engineering.

[16]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[17]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[18]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[19]  Subbarao Kambhampati,et al.  Joint optimization of cost and coverage of query plans in data integration , 2001, CIKM '01.

[20]  Subbarao Kambhampati,et al.  Mining coverage statistics for websource selection in a mediator , 2002, CIKM '02.

[21]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[22]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[23]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.