Modeling performance-driven workload characterization of web search systems

In this paper we model workloads for a web search system from the performance point of view. We analyze both workload intensity and service demand parameters expressed in the context of web search systems as the distribution of the interarrival times of queries and the per-query execution time, respectively. Our results are derived from experiments in an information retrieval testbed fed with real-world experimental data. Our findings unveil a certain number of unexpected and interesting features. We verify in practice that there is a high variability in both interarrival times of queries reaching a search engine and service times of queries processed in parallel by a cluster of index servers. We also show that this highly variable behavior can be accurately captured by hyperexponential distributions. These results shed light on the usual assumption taken by previous analytical models for web search systems found in the literature that interarrival times and service times are exponentially distributed. We find evidence that the intensity and service demand workloads of a typical web search system present long-range dependence characteristics, leading to self-similar behavior. This finding is important because, in the presence of long-range dependence and self-similarity, exponential-based models tend to underestimate response times as self-similarity leads to increased queueing delays, resulting in significant performance degradation. Based on our findings, we also discuss possible steps toward a generative model for synthetic workloads.

[1]  Berthier A. Ribeiro-Neto,et al.  Basic issues on the processing of web queries , 2005, SIGIR '05.

[2]  N. Ziviani,et al.  Distributed query processing using partitioned inverted files , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[3]  Michalis Faloutsos,et al.  Long-range dependence ten years of Internet traffic modeling , 2004, IEEE Internet Computing.

[4]  Ricardo A. Baeza-Yates,et al.  Modeling user search behavior , 2005, Third Latin American Web Congress (LA-WEB'2005).

[5]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[6]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[7]  Mark S. Squillante,et al.  MATRIX-ANALYTIC ANALYSIS OF A MAP/PH/1 QUEUE FITTED TO WEB SERVER DATA , 2002 .

[8]  Knut Magne Risvik,et al.  Multi-tier architecture for Web search engines , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[9]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[10]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[11]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[12]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[13]  Ricardo A. Baeza-Yates,et al.  Analyzing imbalance among homogeneous index servers in a web search system , 2007, Inf. Process. Manag..

[14]  Virgílio A. F. Almeida,et al.  Performance by Design - Computer Capacity Planning By Example , 2004 .

[15]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[16]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[17]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[18]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[19]  Iadh Ounis,et al.  A case study of distributed information retrieval architectures to index one terabyte of text , 2005, Inf. Process. Manag..

[20]  Walter Willinger,et al.  Experimental queueing analysis with long-range dependent packet traffic , 1996, TNET.

[21]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[22]  Sally Floyd,et al.  Wide area traffic: the failure of Poisson modeling , 1995, TNET.

[23]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[24]  SpinkAmanda,et al.  An analysis of web searching by European AlltheWeb.com users , 2005 .

[25]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[26]  Virgílio A. F. Almeida,et al.  A hierarchical characterization of a live streaming media workload , 2006, TNET.

[27]  Murad S. Taqqu,et al.  On the Self-Similar Nature of Ethernet Traffic , 1993, SIGCOMM.

[28]  Abdur Chowdhury,et al.  Operational requirements for scalable search systems , 2003, CIKM '03.

[29]  Martin Arlitt,et al.  Workload Characterization of the 1998 World Cup Web Site , 1999 .

[30]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[31]  Gennady Samorodnitsky,et al.  Long Range Dependence , 2007, Found. Trends Stoch. Syst..

[32]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[33]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[34]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[35]  Alma Riska,et al.  An EM-based technique for approximating long-tailed data sets with PH distributions , 2004, Perform. Evaluation.

[36]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[37]  Ren Asmussen,et al.  Fitting Phase-type Distributions via the EM Algorithm , 1996 .

[38]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .