论文信息 - Distributed parallel generation of indices for very large text databases

Distributed parallel generation of indices for very large text databases

We propose a new algorithm for the parallel generation of suffix arrays for large text databases on high-bandwidth computer networks. Suffix arrays are structures used in full text indexing which support very powerful query languages. Our algorithm is based on a parallel indirect mergesort (it is not a simple mergesort procedure) and is compared with a well known sequential algorithm (which is very efficient running on a single machine). Although network-bounded, the parallel version is theoretically and experimentally a much better alternative when compared to the sequential version (which is I/O-bounded in disk).

[1] Jack J. Dongarra,et al. The PVM Concurrent Computing System: Evolution, Experiences, and Trends , 1994, Parallel Comput..

[2] Donald E. Knuth,et al. The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[3] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[4] R. Beivide,et al. Assessing the performance of the new IBM SP2 communication subsystem : Parallel architectures , 1996 .

[5] Donald E. Knuth,et al. The art of computer programming: sorting and searching (volume 3) , 1973 .

[6] George A. Miller,et al. Length-Frequency Statistics for Written English , 1958, Inf. Control..

[7] Donald R. Morrison,et al. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[8] Gaston H. Gonnet,et al. New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.