Learning response time for WebSources using query feedback and application in query optimization

Abstract. The rapid growth of the Internet and support for interoperability protocols has increased the number of Web accessible sources, WebSources. Current wrapper mediator architectures need to be extended with a wrapper cost model (WCM) for WebSources that can estimate the response time (delays) to access sources as well as other relevant statistics. In this paper, we present a Web prediction tool (WebPT), a tool that is based on learning using query feedback from WebSources. The WebPT uses dimensions time of day, day, and quantity of data, to learn response times from a particular WebSource, and to predict the expected response time (delay) for some query. Experiment data was collected from several sources, and those dimensions that were significant in estimating the response time were determined. We then trained the WebPT on the collected data, to use the three dimensions mentioned above, and to predict the response time, as well as a confidence in the prediction. We describe the WebPT learning algorithms, and report on the WebPT learning for WebSources. Our research shows that we can improve the quality of learning by tuning the WebPT features, e.g., training the WebPT using a logarithm of the input training data; including significant dimensions in the WebPT; or changing the ordering of dimensions. A comparison of the WebPT with more traditional neural network (NN) learning has been performed, and we briefly report on the comparison. We then demonstrate how the WebPT prediction of delay may be used by a scrambling enabled optimizer. A scrambling algorithm identifies some critical points of delay, where it makes a decision to scramble (modify) a plan, to attempt to hide the expected delay by computing some other part of the plan that is unaffected by the delay. We explore the space of real delay at a WebSource, versus the WebPT prediction of this delay, with respect to critical points of delay in specific plans. We identify those cases where WebPT overestimation or underestimation of the real delay results in a penalty in the scrambling enabled optimizer, and those cases where there is no penalty. Using the experimental data and WebPT learning, we test how good the WebPT is in minimizing these penalties.

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  Yannis E. Ioannidis,et al.  Randomized algorithms for optimizing large join queries , 1990, SIGMOD '90.

[3]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[4]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[5]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[6]  Weimin Du,et al.  Query Optimization in a Heterogeneous DBMS , 1992, VLDB.

[7]  Béatrice Finance,et al.  IRO-DB: a distributed system federating object and relational databases , 1995 .

[8]  Ahmed K. Elmagarmid,et al.  Object-Oriented Multidatabase Systems: A Solution for Advanced Applications , 1995 .

[9]  Patrick Valduriez,et al.  A Methodology for Query Reformulation in CIS Using Semantic Knowledge , 1996, Int. J. Cooperative Inf. Syst..

[10]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[11]  Patrick Valduriez,et al.  Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[12]  Laura M. Haas,et al.  Capabilities-based query rewriting in mediator systems , 1996 .

[13]  Laurent Amsaleg,et al.  Scrambling query plans to cope with unexpected delays , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[14]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[15]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[16]  Miron Livny,et al.  The Case for Enhanced Abstract Data Types , 1997, VLDB.

[17]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[18]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[19]  R. Wilder,et al.  Wide-area Internet traffic patterns and characteristics , 1997, IEEE Netw..

[20]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[21]  Peter Scheuermann,et al.  Selection algorithms for replicated Web servers , 1998, PERV.

[22]  Peter W. Glynn,et al.  Internet service performance failure detection , 1998, PERV.

[23]  Yannis Papakonstantinou,et al.  Using Knowledge of Redundancy for Query Optimization in Mediators , 1998 .

[24]  Hubert Naacke,et al.  Leveraging mediator cost models with heterogeneous data sources , 1998, Proceedings 14th International Conference on Data Engineering.

[25]  Laura Bright,et al.  A Wrapper Generation toolkit to specify and construct Wrappersfor Web Accessible Data Sources ( WebSources ) , 1999 .

[26]  Vladimir Zadorozhny,et al.  Learning response times for WebSources: a comparison of a web prediction tool (WebPT) and a neural network , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[27]  Laura M. Haas,et al.  Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System , 1999, VLDB.

[28]  Vladimir Zadorozhny,et al.  Efficient evaluation of queries in a mediator for WebSources , 2002, SIGMOD '02.