Effectively mining and using coverage and overlap statistics for data integration

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and, present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.

[1]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[2]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[3]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[4]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[5]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[6]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[7]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[8]  Sumit Ganguly,et al.  Query optimization for parallel execution , 1992, SIGMOD '92.

[9]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[10]  Michael R. Genesereth,et al.  Answering recursive queries using views , 1997, PODS '97.

[11]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  Alon Y. Halevy,et al.  Theory of answering queries using views , 2000, SGMD.

[13]  Per-Åke Larson,et al.  Developing Regression Cost Models for Multidatabase Systems. , 1996 .

[14]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[15]  Michael Stonebraker,et al.  Mariposa: a wide-area distributed database system , 1996, The VLDB Journal.

[16]  Subbarao Kambhampati,et al.  Mining coverage statistics for websource selection in a mediator , 2002, CIKM '02.

[17]  Subbarao Kambhampati,et al.  Planning for Information Gathering: A Tutorial Survey , 1997 .

[18]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[19]  Felix Naumann,et al.  Exploring Life Sciences Data Sources , 2003, IIWeb.

[20]  Subbarao Kambhampati,et al.  Joint optimization of cost and coverage of query plans in data integration , 2001, CIKM '01.

[21]  Keith L. Clark,et al.  Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[22]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[23]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[24]  A. Winsor Sampling techniques. , 2000, Nursing times.

[25]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[26]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[27]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[28]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[29]  Jeffrey D. Ullman,et al.  Optimizing Large Join Queries in Mediation Systems , 1999, ICDT.

[30]  Michael R. Genesereth,et al.  Infomaster: A Virtual Information System , 1995, CIKM Information Agents Workshop.

[31]  Clement T. Yu,et al.  Concept hierarchy based text database categorization in a metasearch engine environment , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[32]  G. Zipf,et al.  Relative Frequency as a Determinant of Phonetic Change , 1930 .

[33]  Xiaolei Qian,et al.  Query folding , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[34]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[35]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[36]  L. C. Green,et al.  Georgia , 1958 .

[37]  Daniel S. Weld,et al.  Planning to gather inforrnation , 1996, AAAI 1996.

[38]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[39]  G. Nemhauser,et al.  Exceptional Paper—Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms , 1977 .

[40]  J. A. Salvato John wiley & sons. , 1994, Environmental science & technology.

[41]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[42]  Alon Y. Halevy,et al.  Efficiently ordering query plans for data integration , 1999, Proceedings 18th International Conference on Data Engineering.

[43]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[44]  Alon Y. Halevy,et al.  Recursive Plans for Information Gathering , 1997, IJCAI.

[45]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[46]  HalevyAlon,et al.  MiniCon: A scalable algorithm for answering queries using views , 2001, VLDB 2001.

[47]  Mihalis Yannakakis,et al.  Multiobjective query optimization , 2001, PODS '01.

[48]  Subbarao Kambhampati,et al.  Optimizing Recursive Information Gathering Plans in EMERAC , 2004, Journal of Intelligent Information Systems.

[49]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[50]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[51]  Subbarao Kambhampati,et al.  Mining source coverage statistics for data integration , 2001, WIDM '01.

[52]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[53]  Calton Pu,et al.  Distributed Query Scheduling Service: An Architecture and Its Implementation , 1998, Int. J. Cooperative Inf. Syst..

[54]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[55]  George Kingsley Zipf,et al.  Relative Frequency as a Determinant of Phonetic Change , 1930 .

[56]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[57]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[58]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[59]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[60]  Subbarao Kambhampati,et al.  Efficiently Executing Information Gathering Plans , 1998 .

[61]  Subbarao Kambhampati,et al.  BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration , 2003, VLDB.

[62]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[63]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[64]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.

[65]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[66]  George L. Nemhauser,et al.  Note--On "Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms" , 1979 .

[67]  Pallab Dasgupta,et al.  Multiobjective Heuristic Search , 1999, Computational Intelligence.

[68]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[69]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[70]  Qiang Zhu,et al.  Building regression cost models for multidatabase systems , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[71]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[72]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[73]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[74]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[75]  Ioana Manolescu,et al.  Query optimization in the presence of limited access patterns , 1999, SIGMOD '99.

[76]  Vladimir Zadorozhny,et al.  Efficient evaluation of queries in a mediator for WebSources , 2002, SIGMOD '02.