Searching a Terabyte of Text Using Partial Replication

Abstract : The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. Using a validated simulator, we compare database partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time, even with fewer resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte text database. We further investigate query locality with respect to time, replica size, and replica updating costs using real logs from THOMAS and Excite, and discuss the sensitivity of our results to these sample points.

[1]  Kathryn S. McKinley,et al.  The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors , 1998, Euro-Par.

[2]  Philip S. Yu,et al.  Performance Study of a Collaborative Method for Hierarchical Caching in Proxy Servers , 1998, Comput. Networks.

[3]  Ellen W. Zegura,et al.  Application-layer anycasting , 1997, Proceedings of INFOCOM '97.

[4]  Mark Crovella,et al.  Dynamic Server Selection using Bandwidth Probing in Wide-Area Networks , 1996 .

[5]  Stanley B. Zdonik,et al.  An Efficient Scheme for Dynamic Data Replication , 1993 .

[6]  Forbes J. Burkowski Retrieval performance of a distributed text database utilizing a parallel processor document server , 1990, DPDS '90.

[7]  Peter Sturm,et al.  Introducing Application-Level Replication and Naming into Today's Web , 1996, Comput. Networks.

[8]  Michelle Butler,et al.  A Scalable HTTP Server: The NCSA Prototype , 1994, Comput. Networks ISDN Syst..

[9]  Donna K. Harman,et al.  Prototyping a distributed information retrieval system that uses statistical ranking , 1991, Inf. Process. Manag..

[10]  Kathryn S. McKinley,et al.  Partial replica selection based on relevance for information retrieval , 1999, SIGIR '99.

[11]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[12]  Z. Lin,et al.  Parallelizing I/O intensive applications for a workstation cluster: a case study , 1993, CARN.

[13]  Patrick Martin,et al.  Data caching strategies for distributed full text retrieval systems , 1991, Inf. Syst..

[14]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[15]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[16]  Michael F. Schwartz,et al.  Locating nearby copies of replicated Internet servers , 1995, SIGCOMM '95.

[17]  Tasha Cooper Thomas: Legislative information on the internet , 1997 .

[18]  Ouri Wolfson,et al.  A competitive dynamic data replication algorithm , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[19]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[20]  Anthony Tomasic Distributed queries and incremental updates in information retrieval systems , 1994 .

[21]  Tao Yang,et al.  Cooperative caching of dynamic content on a distributed Web server , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[22]  Mukesh Singhal,et al.  An Analysis of Performance and Cost Factors in Searching Large Text Databases Using Parallel Search Systems , 1994, Journal of the American Society for Information Science.

[23]  Patrick Martin,et al.  Strategies for building distributed information retrieval systems , 1987, Inf. Process. Manag..

[24]  Patrick Martin,et al.  A case study of caching strategies for a distributed full text retrieval system , 1990, Inf. Process. Manag..

[25]  Mostafa H. Ammar,et al.  Performance Characterization of Quorum-Consensus Algorithms for Replicated Data , 1989, IEEE Trans. Software Eng..

[26]  Kathryn S. McKinley,et al.  Scalable distributed architectures for information retrieval , 1999 .

[27]  Kathryn S. McKinley,et al.  Performance evaluation of a distributed architecture for information retrieval , 1996, SIGIR '96.

[28]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[29]  Azer Bestavros,et al.  Demand-based document dissemination to reduce traffic and balance load in distributed information systems , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.