Executing incoherency bounded continuous queries at web data aggregators

Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some function over distributed data items, for example, to determine when and whether (a) the traffic entering a highway from multiple feed roads will result in congestion in a thoroughfare or (b) the value of a stock portfolio exceeds a threshold. Using the standard Web infrastructure for these applications will increase the reach of the underlying information. But, since these queries involve data from multiple sources, with sources supporting standard HTTP (pull-based) interfaces, special query processing techniques are needed. Also, these applications often have the flexibility to tolerate some incoherency, i.e., some differences between the results reported to the user and that produced from the virtual database made up of the distributed data sources.In this paper, we develop and evaluate client-pull-based techniques for refreshing data so that the results of the queries over distributed data can be correctly reported, conforming to the limited incoherency acceptable to the users.We model as well as estimate the dynamics of the data items using a probabilistic approach based on Markov Chains. Depending on the dynamics of data we adapt the data refresh times to deliver query results with the desired coherency. The commonality of data needs of multiple queries is exploited to further reduce refresh overheads. Effectiveness of our approach is demonstrated using live sources of dynamic data: the number of refreshes it requires is (a) an order of magnitude less than what we would need if every potential update is pulled from the sources, and (b) comparable to the number of messages needed by an ideal algorithm, one that knows how to optimally refresh the data from distributed data sources. Our evaluations also bring out a very practical and attractive tradeoff property of pull based approaches, e.g., a small increase in tolerable incoherency leads to a large decrease in message overheads.

[1]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[2]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[3]  Prashant J. Shenoy,et al.  Adaptive push-pull: disseminating dynamic web data , 2001, WWW '01.

[4]  Jennifer Widom,et al.  Adaptive precision setting for cached approximate values , 2001, SIGMOD '01.

[5]  Margo I. Seltzer,et al.  The case for geographical push-caching , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[6]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[7]  Rafael Alonso,et al.  Data caching issues in an information retrieval system , 1990, TODS.

[8]  Prashant J. Shenoy,et al.  Adaptive leases: a strong consistency mechanism for the World Wide Web , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[9]  Ernst W. Biersack,et al.  Continuous multicast push of Web documents over the Internet , 1998, IEEE Netw..

[10]  Adam Dingle,et al.  Web Cache Coherence , 1996, Comput. Networks.

[11]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[12]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[13]  Arun Iyengar,et al.  Improving Web Server Performance by Caching Dynamic Data , 1997, USENIX Symposium on Internet Technologies and Systems.

[14]  Michael Dahlin,et al.  Volume Leases for Consistency in Large-Scale Systems , 1999, IEEE Trans. Knowl. Data Eng..

[15]  Krithi Ramamritham,et al.  Maintaining temporal coherency of virtual data warehouses , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[16]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[17]  Pablo Rodriguez,et al.  Improving the WWW: Caching or Multicast? , 1998, Comput. Networks.

[18]  P. Wilmott,et al.  The Mathematics of Financial Derivatives: Contents , 1995 .

[19]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[20]  Jonathan Goldstein,et al.  Relaxed currency and consistency: how to say "good enough" in SQL , 2004, SIGMOD '04.

[21]  Azer Bestavros,et al.  Speculative data dissemination and service to reduce server load, network traffic and service time in distributed information systems , 1996, Proceedings of the Twelfth International Conference on Data Engineering.