Distributed query processing using partitioned inverted files

In this paper; we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that oflers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an imerted$le. We adopt two distinct strategies of index partitioning in the distributed system, namely local index partitioning and global indexpartitioning. In both strategies, documents are ranked using the vector space model along with a documentfiltering technique for fast ranking. We evaluate and compare the impact of the two index partitioning strategies on query processing per$ormance. Experimental results on retrieval eficiency show that, within our framework, the global index partitioning outpe~orms the local index partitioning.

[1]  Ricardo A. Baeza-Yates,et al.  Distributed Query Processing Using Partitioned Inverted Files , 2001, SPIRE.

[2]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[3]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[4]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[5]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Stephen E. Robertson,et al.  Parallel search using partitioned inverted files , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[8]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[9]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[10]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[11]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[12]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.