Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.

[1]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[2]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[3]  Forbes J. Burkowski Retrieval performance of a distributed text database utilizing a parallel processor document server , 1990, DPDS '90.

[4]  David Dickson A global search , 1993, Nature.

[5]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[6]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[7]  Peter Bailey,et al.  A parallel architecture for query processing over a terabyte of text , 1996 .

[8]  Christine Westall Performance Modeling , 2002 .

[9]  Bruce Raymond Schatz Interactive retrieval in information spaces distributed across a wide-area network , 1991 .

[10]  Jeffrey A. Brumfield,et al.  Performance Modeling of Distributed Object-Oriented Database Systems , 1988, Proceedings [1988] International Symposium on Databases in Parallel and Distributed Systems.

[11]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[12]  Z. Lin,et al.  Parallelizing I/O intensive applications for a workstation cluster: a case study , 1993, CARN.

[13]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[14]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[15]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[16]  Charles Nicholas,et al.  An Approach to Large Scale Distributed Information Systems Using Statistical Properties of Text to G , 1995 .

[17]  David J. DeWitt,et al.  Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[18]  Rafael Alonso,et al.  Data cashing in IR systems , 1987, SIGIR '87.

[19]  David L. Waltz,et al.  A parallel indexed algorithm for information retrieval , 1989, SIGIR '89.

[20]  Eric W. Brown,et al.  The GURU System in TREC-6 , 1997, TREC.

[21]  Peter B. Danzig,et al.  Distributed indexing: a scalable mechanism for distributed information retrieval , 1991, SIGIR '91.

[22]  Donna K. Harman,et al.  Prototyping a distributed information retrieval system that uses statistical ranking , 1991, Inf. Process. Manag..

[23]  Craig Stanfill,et al.  Parallel free-text search on the connection machine system , 1986, CACM.

[24]  Alistair Moffat,et al.  Information Retrieval Systems for Large Document Collections , 1994, TREC.

[25]  Peter Willett,et al.  Parallel text searching in serial files using a processor farm , 1989, SIGIR '90.

[26]  Steve Renals,et al.  Proceedings of the Ninth Text REtrieval Conference , 2001 .

[27]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[28]  Rafael Alonso,et al.  Data Caching in Information Retrieval Systems. , 1987, SIGIR 1987.

[29]  Patrick Martin,et al.  A case study of caching strategies for a distributed full text retrieval system , 1990, Inf. Process. Manag..

[30]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[31]  Hector Garcia-Molina,et al.  Performance of Inverted Indices in Distributed Text Document Retrieval Systems , 1993 .

[32]  Dik Lun Lee,et al.  An analysis of performance and cost factors in searching large text databases using parallel search systems , 1994 .

[33]  Anthony Tomasic Distributed queries and incremental updates in information retrieval systems , 1994 .

[34]  Patrick Martin,et al.  Strategies for building distributed information retrieval systems , 1987, Inf. Process. Manag..

[35]  Kathryn S. McKinley,et al.  The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors , 1998, Euro-Par.

[36]  David A. Bell,et al.  Distributed database systems , 1992 .

[37]  Hector Garcia-Molina,et al.  Caching and database scaling in distributed shared-nothing information retrieval systems , 1993, SIGMOD '93.

[38]  Robert B. Hagmann Performance analysis of several backend database system architectures , 1983, Perform. Evaluation.

[39]  Patrick Martin,et al.  Data caching strategies for distributed full text retrieval systems , 1991, Inf. Syst..

[40]  Guy M. Lohman,et al.  R* optimizer validation and performance evaluation for local queries , 1986, SIGMOD '86.

[41]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[42]  Dietmar Wolfram,et al.  Applying Informetric Characteristics of Databases to IR System File Design, Part I: Informetric Models , 1992, Inf. Process. Manag..

[43]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[44]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[45]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[46]  Charles L. A. Clarke,et al.  A Global Search Architecture , 1995 .

[47]  Mukesh Singhal,et al.  An Analysis of Performance and Cost Factors in Searching Large Text Databases Using Parallel Search Systems , 1994, Journal of the American Society for Information Science.

[48]  Domenico Ferrari,et al.  Performance analysis of several back-end database architectures , 1986, TODS.

[49]  Mark A. Sheldon,et al.  Content Routing for Distributed Information Servers , 1994, EDBT.

[50]  David Hawking Scalable Text Retrieval for Large Digital Libraries , 1997, ECDL.

[51]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[52]  Kathryn S. McKinley,et al.  Performance evaluation of a distributed architecture for information retrieval , 1996, SIGIR '96.

[53]  Hava T. Siegelmann,et al.  On the allocation of documents in multiprocessor information retrieval systems , 1991, SIGIR '91.

[54]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[55]  Peter Willett,et al.  Use of text signatures for document retrieval in a highly parallel environment , 1987, Parallel Comput..