Efficient distributed algorithms to build inverted files

We present three distributed algorithms to build global inverted files for very large text collections. The distributed environment we use is a high bandwidth network of workstations with a shared-nothing memory organization. The text collection is assumed to be evenly distributed among the disks of the various workstations. Our algorithms consider that the total distributed main memory is considerably smaller than the inverted file to be generated. The inverted file is compressed to save memory and disk space and to save time for moving data in/out disk and across the network. We analyze our algorithms and discuss the tradeoffs among them. We show that, with 8 processors and 16 megabytes of RAM available in each processor, the advanced variants of our algorithms are able to invert a 100 gigabytes collection (the size of the very large TREC-7 collection) in roughly 8 hours. Using 16 processors this time drops to roughly 4 hours.

[1]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[2]  R. K. Wiersba Review of "Information Retrieval: Computational and Theoretical Aspects, by H. S. Heaps", Academic Press Inc. , 1980, SIGF.

[3]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[4]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[5]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[6]  George Havas,et al.  An Optimal Algorithm for Generating Minimal Perfect Hash Functions , 1992, Inf. Process. Lett..

[7]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[8]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[9]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[10]  Connie Page,et al.  Computing Science and Statistics , 1992 .

[11]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Berthier A. Ribeiro-Neto,et al.  Parallel generation of inverted files for distributed text collections , 1998, Proceedings SCCC'98. 18th International Conference of the Chilean Society of Computer Science (Cat. No.98EX212).