Optimizing Recursive Information Gathering Plans in EMERAC

In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of heuristics that guide the greedy minimization algorithm so as to remove costlier information sources first. In contrast to previous work, our approach can handle recursive query plans that arise commonly in the presence of constrained sources. Second, we present a method for ordering the access to sources to reduce the execution cost. This problem differs significantly from the traditional database query optimization problem as sources on the Internet have a variety of access limitations and the execution cost in information gathering is affected both by network traffic and by the connection setup costs. Furthermore, because of the autonomous and decentralized nature of the Web, very little cost statistics about the sources may be available. In this paper, we propose a heuristic algorithm for ordering source calls that takes these constraints into account. Specifically, our algorithm takes both access costs and traffic costs into account, and is able to operate with very coarse statistics about sources (i.e., without depending on full source statistics). Finally, we will discuss implementation and empirical evaluation of these methods in Emerac, our prototype information gathering system.

[1]  Michael R. Genesereth,et al.  Answering recursive queries using views , 1997, PODS '97.

[2]  Yehoshua Sagiv Optimizing Datalog Programs , 1988, Foundations of Deductive Databases and Logic Programming..

[3]  Qiang Zhu,et al.  Building regression cost models for multidatabase systems , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[4]  Hector Garcia-Molina,et al.  Capability-sensitive query processing on Internet sources , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Hamid Pirahesh,et al.  Extensible query processing in starburst , 1989, SIGMOD '89.

[6]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[7]  Subbarao Kambhampati,et al.  Mining source coverage statistics for data integration , 2001, WIDM '01.

[8]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[9]  Alon Y. Halevy,et al.  Obtaining Complete Answers from Incomplete Databases , 1996, VLDB.

[10]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[11]  Oliver M. Duschka Query Optimization Using Local Completeness , 1997, AAAI/IAAI.

[12]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[13]  Ramez Elmasri,et al.  Fundamentals of database systems (2nd ed.) , 1994 .

[14]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[15]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[16]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[17]  Per-Åke Larson,et al.  Developing Regression Cost Models for Multidatabase Systems. , 1996 .

[18]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[19]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[20]  Jeffrey D. Ullman,et al.  Optimizing Large Join Queries in Mediation Systems , 1999, ICDT.

[21]  Subbarao Kambhampati,et al.  Mining coverage statistics for websource selection in a mediator , 2002, CIKM '02.

[22]  Michael R. Genesereth,et al.  Infomaster: A Virtual Information System , 1995, CIKM Information Agents Workshop.

[23]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[24]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[25]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[26]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[27]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[28]  Alon Y. Halevy,et al.  Recursive Plans for Information Gathering , 1997, IJCAI.

[29]  Subbarao Kambhampati,et al.  Joint optimization of cost and coverage of query plans in data integration , 2001, CIKM '01.

[30]  Ioana Manolescu,et al.  Query optimization in the presence of limited access patterns , 1999, SIGMOD '99.

[31]  Yannis Papakonstantinou,et al.  Using Knowledge of Redundancy for Query Optimization in Mediators , 1998 .

[32]  Michael R. Genesereth,et al.  Query planning and optimization in information integration , 1997 .

[33]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[34]  Chun-Nan Hsu Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules , 1998 .

[35]  Subbarao Kambhampati,et al.  Optimizing source-call ordering in Information Gathering Plans , 1999, Intelligent Information Integration.

[36]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[37]  Subbarao Kambhampati,et al.  Planning for Information Gathering: A Tutorial Survey , 1997 .

[38]  Katherine A. Morris An algorithm for ordering subgoals in NAIL? , 1988, PODS '88.

[39]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[40]  Yannis Papakonstantinou,et al.  Describing and Using Query Capabilities of Heterogeneous Sources , 1997, VLDB.

[41]  Xiaolei Qian,et al.  Query folding , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[42]  Kyuseok Shim,et al.  Query Optimization in the Presence of Foreign Functions , 1993, VLDB.

[43]  Daniel S. Weld,et al.  Planning to gather inforrnation , 1996, AAAI 1996.

[44]  Marc Friedman,et al.  Efficiently Executing Information-Gathering Plans , 1997, IJCAI.

[45]  Oren Etzioni,et al.  Sound and Efficient Closed-World Reasoning for Planning , 1997, Artif. Intell..

[46]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[47]  Yehoshua Sagiv,et al.  Optimizing datalog programs , 1987, Foundations of Deductive Databases and Logic Programming..

[48]  Vladimir Zadorozhny,et al.  Learning response time for WebSources using query feedback and application in query optimization , 2000, The VLDB Journal.