Analyzing imbalance among homogeneous index servers in a web search system

The performance of parallel query processing in a cluster of index servers is crucial for modern web search systems. In such a scenario, the response time basically depends on the execution time of the slowest server to generate a partial ranked answer. Previous approaches investigate performance issues in this context using simulation, analytical modeling, experimentation, or a combination of them. Nevertheless, these approaches simply assume balanced execution times among homogeneous servers (by uniformly distributing the document collection among them, for instance)-a scenario that we did not observe in our experimentation. On the contrary, we found that even with a balanced distribution of the document collection among index servers, correlations between the frequency of a term in the query log and the size of its corresponding inverted list lead to imbalances in query execution times at these same servers, because these correlations affect disk caching behavior. Further, the relative sizes of the main memory at each server (with regard to disk space usage) and the number of servers participating in the parallel query processing also affect imbalance of local query execution times. These are relevant findings that have not been reported before and that, we understand, are of interest to the research community.

[1]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[2]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[3]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[4]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[5]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[6]  Berthier A. Ribeiro-Neto,et al.  Basic issues on the processing of web queries , 2005, SIGIR '05.

[7]  N. Ziviani,et al.  Distributed query processing using partitioned inverted files , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Stephen E. Robertson,et al.  Parallel search using partitioned inverted files , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[11]  Iadh Ounis,et al.  A case study of distributed information retrieval architectures to index one terabyte of text , 2005, Inf. Process. Manag..

[12]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[13]  Knut Magne Risvik,et al.  Multi-tier architecture for Web search engines , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[14]  Ricardo A. Baeza-Yates,et al.  Distributed Query Processing Using Partitioned Inverted Files , 2001, SPIRE.

[15]  Abdur Chowdhury,et al.  Operational requirements for scalable search systems , 2003, CIKM '03.