Workload analysis for scientific literature digital libraries

Workload studies of large-scale systems may help locating possible bottlenecks and improving performances. However, previous workload analysis for Web applications is typically focused on generic platforms, neglecting the unique characteristics exhibited in various domains of these applications. It is observed that different application domains have intrinsically heterogeneous characteristics, which have a direct impact on the system performance. In this study, we present an extensive analysis into the workload of scientific literature digital libraries, unveiling their temporal and user interest patterns. Logs of a computer science literature digital library, CiteSeer, are collected and analyzed. We intentionally remove service details specific to CiteSeer. We believe our analysis is applicable to other systems with similar characteristics. While many of our findings are consistent with previous Web analysis, we discover several unique characteristics of scientific literature digital library workload. Furthermore, we discuss how to utilize our findings to improve system performance.

[1]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[2]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[3]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[4]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[5]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[6]  Christoph Hölscher How Internet Experts Search For Information On The Web , 1998, WebNet.

[7]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[8]  María Engracia Gómez,et al.  Analysis of self-similarity in I/O workload using structural modeling , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[9]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[10]  Daniel A. Reed,et al.  ARIMA time series modeling and forecasting for adaptive I/O prefetching , 2001, ICS '01.

[11]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[12]  Myra Spiliopoulou,et al.  Measuring the Accuracy of Sessionizers for Web Usage Analysis , 2001 .

[13]  Chenyang Lu,et al.  An adaptive control framework for QoS guarantees and its application to differentiated caching , 2002, IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564).

[14]  Terence Kelly,et al.  Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.

[15]  Carey L. Williamson,et al.  Web server benchmarking using parallel WAN emulation , 2002, SIGMETRICS '02.

[16]  C. Lee Giles,et al.  Probabilistic user behavior models , 2003, Third IEEE International Conference on Data Mining.

[17]  Achim Streit,et al.  Self-tuning job scheduling strategies for the resource management of HPC systems and computational grids , 2003 .

[18]  Surajit Chaudhuri,et al.  Primitives for Workload Summarization and Implications for SQL , 2003, VLDB.

[19]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[20]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[21]  Anand Sivasubramaniam,et al.  Synthesizing Representative I/O Workloads for TPC-H , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[22]  Yanyan Wang,et al.  Weevil: a Tool to Automate Experimentation With Distributed Systems , 2004 .

[23]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[24]  C. Badue Distributed Processing of Conjunctive Queries , 2005 .

[25]  Berthier A. Ribeiro-Neto,et al.  Basic issues on the processing of web queries , 2005, SIGIR '05.

[26]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[27]  Peter A. Dinda,et al.  An Extensible Toolkit for Resource Prediction In Distributed Systems , 1999 .

[28]  Anand Sivasubramaniam,et al.  SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines , 2007, JCDL '07.