Document replication strategies for geographically distributed web search engines

Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine.

[1]  Jussi Kangasharju,et al.  Object replication strategies in content distribution networks , 2002, Comput. Commun..

[2]  Rajmohan Rajaraman,et al.  Analysis of a local search heuristic for facility location problems , 2000, SODA '98.

[3]  Ishfaq Ahmad,et al.  Static and adaptive data replication algorithms for fast information access in large distributed systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[4]  Kavitha Ranganathan,et al.  Identifying Dynamic Replication Strategies for a High-Performance Data Grid , 2001, GRID.

[5]  Iadh Ounis,et al.  Performance analysis of distributed information retrieval architectures using an improved network simulation model , 2007, Inf. Process. Manag..

[6]  Jemal H. Abawajy,et al.  An efficient replicated data access approach for large-scale distributed systems , 2004, CCGRID.

[7]  Iadh Ounis,et al.  Performance Comparison of Clustered and Replicated Information Retrieval Systems , 2007, ECIR.

[8]  Ishfaq Ahmad,et al.  Comparison and analysis of ten static heuristics-based Internet data replication techniques , 2008, J. Parallel Distributed Comput..

[9]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[10]  Donna Harman,et al.  Information Processing and Management , 2022 .

[11]  Bharat K. Bhargava,et al.  Replication Techniques in Distributed Systems , 1996, Advances in Database Systems.

[12]  Kathryn S. McKinley,et al.  Partial collection replication versus caching for information retrieval systems , 2000, SIGIR '00.

[13]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[14]  Pavlin Radoslavov,et al.  Topology-informed Internet replica placement , 2002, Comput. Commun..

[15]  Gerhard Weikum,et al.  Near-optimal dynamic replication in unstructured peer-to-peer networks , 2008, PODS.

[16]  Brian Tierney,et al.  File and Object Replication in Data Grids , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[17]  Berkant Barla Cambazoglu,et al.  Document assignment in multi-site search engines , 2011, WSDM '11.

[18]  Shudong Jin,et al.  Content and service replication strategies in multi-hop wireless mesh networks , 2005, MSWiM '05.

[19]  Kathryn S. McKinley,et al.  Partial replica selection based on relevance for information retrieval , 1999, SIGIR '99.

[20]  Berkant Barla Cambazoglu,et al.  Performance of query processing implementations in ranking-based text retrieval systems using inverted indices , 2006, Inf. Process. Manag..

[21]  Boleslaw K. Szymanski,et al.  Simulation of dynamic data replication strategies in Data Grids , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[22]  Yuval Shavitt,et al.  Constrained mirror placement on the Internet , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[23]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[24]  Berkant Barla Cambazoglu,et al.  Quantifying performance and quality gains in distributed web search engines , 2009, SIGIR.

[25]  Gustavo Alonso,et al.  Ganymed: Scalable Replication for Transactional Web Applications , 2004, Middleware.

[26]  Aristides Gionis,et al.  On the feasibility of multi-site web search engines , 2009, CIKM.

[27]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[28]  Kavitha Ranganathan,et al.  Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , 2003, Journal of Grid Computing.

[29]  Ming Tang,et al.  The impact of data replication on job scheduling performance in the Data Grid , 2006, Future Gener. Comput. Syst..

[30]  Dimitris Papadias,et al.  An overview of data replication on the Internet , 2002, Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN'02.

[31]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[32]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[33]  Lili Qiu,et al.  On the placement of Web server replicas , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[34]  Roi Blanco,et al.  Assigning documents to master sites in distributed search , 2011, CIKM '11.

[35]  Berkant Barla Cambazoglu,et al.  Query forwarding in geographically distributed search engines , 2010, SIGIR.

[36]  Deeparnab Chakrabarty,et al.  Knapsack Problems , 2008 .

[37]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[38]  Donald H. Kraft,et al.  Advances in Information Retrieval: Where Is That /#*&@¢ Record? , 1985, Adv. Comput..

[39]  Berkant Barla Cambazoglu,et al.  A refreshing perspective of search engine caching , 2010, WWW '10.

[40]  Ishfaq Ahmad,et al.  Design and Evaluation of Data Allocation Algorithms for Distributed Multimedia Database Systems , 1996, IEEE J. Sel. Areas Commun..

[41]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[42]  Nabil R. Adam,et al.  Distributed file allocation with consistency constraints , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[43]  Rajmohan Rajaraman,et al.  A dynamic object replication and migration protocol for an Internet hosting service , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[44]  Magnus Karlsson,et al.  A Framework for Evaluating Replica Placement Algorithms , 2002 .

[45]  Ishfaq Ahmad,et al.  Heuristics-Based Replication Schemas for Fast Information Retrieval over the Internet , 2004, PDCS.

[46]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[47]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[48]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[49]  Ming Tang,et al.  Dynamic replication algorithms for the multi-tier Data Grid , 2005, Future Gener. Comput. Syst..

[50]  Savio S. H. Tse Approximate algorithms for document placement in distributed Web servers , 2005, IEEE Transactions on Parallel and Distributed Systems.

[51]  Hiroshi Yamamoto,et al.  Replication methods for load balancing on distributed storages in P2P networks , 2005, The 2005 Symposium on Applications and the Internet.

[52]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[53]  Cho-Li Wang,et al.  Document replication and distribution in extensible geographically distributed web servers , 2003, J. Parallel Distributed Comput..

[54]  Michael Rabinovich,et al.  Issues in Web Content Replication , 1998, IEEE Data Eng. Bull..

[55]  Ioana Manolescu,et al.  Dynamic XML documents with distribution and replication , 2003, SIGMOD '03.

[56]  Berkant Barla Cambazoglu,et al.  Early exit optimizations for additive machine learned ranking systems , 2010, WSDM '10.

[57]  Ruay-Shiung Chang,et al.  Job scheduling and data replication on data grids , 2007, Future Gener. Comput. Syst..

[58]  Udi Manber,et al.  Connecting Diverse Web Search Facilities , 1998, IEEE Data Eng. Bull..

[59]  M Ould-khaoua Replication algorithms for the Wide World Web , 2004 .

[60]  Alexandros Ntoulas,et al.  Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[61]  Rajmohan Rajaraman,et al.  Approximation algorithms for data placement in arbitrary networks , 2001, SODA '01.

[62]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.