SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines

Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.

[1]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[2]  Carey L. Williamson,et al.  Web server benchmarking using parallel WAN emulation , 2002, SIGMETRICS '02.

[3]  Berthier A. Ribeiro-Neto,et al.  Basic issues on the processing of web queries , 2005, SIGIR '05.

[4]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[5]  C. Badue Distributed Processing of Conjunctive Queries , 2005 .

[6]  Yanyan Wang,et al.  Weevil: a Tool to Automate Experimentation With Distributed Systems , 2004 .

[7]  María Engracia Gómez,et al.  Analysis of self-similarity in I/O workload using structural modeling , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[8]  Mark Hansen,et al.  Predicting Web Users' Next Access Based on Log Data , 2003 .

[9]  J. Wrench Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth , 1970 .

[10]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11]  Surajit Chaudhuri,et al.  Primitives for Workload Summarization and Implications for SQL , 2003, VLDB.

[12]  Terence Kelly,et al.  Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.

[13]  Daniel A. Reed,et al.  ARIMA time series modeling and forecasting for adaptive I/O prefetching , 2001, ICS '01.

[14]  Edward A. Fox,et al.  Analysis and modeling of world wide web traffic , 1998 .

[15]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[16]  Achim Streit,et al.  Self-tuning job scheduling strategies for the resource management of HPC systems and computational grids , 2003 .

[17]  Myra Spiliopoulou,et al.  Measuring the Accuracy of Sessionizers for Web Usage Analysis , 2001 .

[18]  Peter A. Dinda,et al.  An Extensible Toolkit for Resource Prediction In Distributed Systems , 1999 .

[19]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[20]  Chenyang Lu,et al.  An adaptive control framework for QoS guarantees and its application to differentiated caching , 2002, IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564).

[21]  Anand Sivasubramaniam,et al.  Synthesizing Representative I/O Workloads for TPC-H , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[22]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[23]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[24]  C. Lee Giles,et al.  Probabilistic user behavior models , 2003, Third IEEE International Conference on Data Mining.

[25]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[26]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[27]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.