Block addressing indices for approximate text retrieval

Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called %lock addressing”. The most notorious index of this kind is Glimpse. However, block addressing has not been well studied yet, especially regarding approximate searching. Our main contribution is an analytical study of the spacetime trade-offs related to the block sire. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. We validate the analysis with extensive experiments, obtaining typical performance figures. These results are valid not only for approximate searching queries but also for classical ones. Finally, we propose a new strategy for approximate searching on block addressing indices, which we experimentally find 4-5 times faster than Glimpse. This algorithm takes advantage of the index even if the whole text has to be scanned. As a side effect, we find that using blocks of fixed size is better than, say, addressing files.

[1]  Amy A. Livingston A Study of Spelling Errors , 1946 .

[2]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[3]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[4]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[5]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[6]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[7]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[8]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[9]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[10]  John C. Nesbit The accuracy of approximate string matching algorithms , 1986 .

[11]  Ricardo A. Baeza-Yates,et al.  Fast approximate string matching in a dictionary , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[12]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[13]  Bruce W. Watson,et al.  A new regular grammar pattern matching algorithm , 1996, Theor. Comput. Sci..

[14]  Mark Crovella,et al.  Self - similarity in World Wide Web: Evidence and possible causes , 1997 .

[15]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[16]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[17]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[18]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[19]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[20]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[21]  Ricardo A. Baeza-Yates,et al.  Direct pattern matching on compressed text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[22]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[24]  Ricardo A. Baeza-Yates,et al.  A Faster Algorithm for Approximate String Matching , 1996, CPM.

[25]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[26]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[27]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[28]  Charles R. Blair,et al.  A Program for Correcting Spelling Errors , 1960, Inf. Control..

[29]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[30]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[31]  Ian H. Witten,et al.  The MG retrieval system: compressing for space and speed , 1995, CACM.