String algorithms and data structures

The string-matching field has grown at a such complicated stage that various issues come into play when studying it: data structure and algorithmic design, database principles, compression techniques, architectural features, cache and prefetching policies. The expertise nowadays required to design good string data structures and algorithms is therefore transversal to many computer science fields and much more study on the orchestration of known, or novel, techniques is needed to make progress in this fascinating topic. This survey is aimed at illustrating the key ideas which should constitute, in our opinion, the current background of every index designer. We also discuss the positive features and drawback of known indexing schemes and algorithms, and devote much attention to detail research issues and open problems both on the theoretical and the experimental side.

[1]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[2]  Sophie Cluet,et al.  Querying XML Documents in Xyleme , 2000, SIGIR 2000.

[3]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[4]  Witold Litwin,et al.  Multilevel Trie Hashing , 1988, EDBT.

[5]  Michael E. Lesk,et al.  Practical Digital Libraries: Books, Bytes, and Bucks , 1997 .

[6]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[7]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[10]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[11]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[12]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[13]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[14]  Stefan Nilsson,et al.  Implementing a Dynamic Compressed Trie , 1998, WAE.

[15]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[16]  Fabrizio Luccio,et al.  Dynamic Dictionary Matching in External Memory , 1998, Inf. Comput..

[17]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[18]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[19]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[20]  Ayumi Shinohara,et al.  Pattern Matching in Text Compressed by Using Antidictionaries , 1999, CPM.

[21]  Raffaele Giancarlo,et al.  The Myriad Virtues of Suffix Trees , 2006 .

[22]  Steven Skiena,et al.  Trie-Based Data Structures for Sequence Assembly , 1997, CPM.

[23]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[24]  Erik D. Demaine,et al.  A linear lower bound on index size for text retrieval , 2001, SODA '01.

[25]  Kunihiko Sadakane A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[26]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[27]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[28]  William Pugh,et al.  Skip lists: a probabilistic alternative to balanced trees , 1989, CACM.

[29]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[30]  Gonzalo Navarro,et al.  A metric index for approximate string matching , 2002, Theor. Comput. Sci..

[31]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[32]  J. Ian Munro Succinct Data Structures , 2004, Electron. Notes Theor. Comput. Sci..

[33]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[34]  Yossi Matias,et al.  Augmenting Suffix Trees, with Applications , 1998, ESA.

[35]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[36]  Wojciech Plandowski,et al.  Efficient algorithms for Lempel-Ziv encoding , 1996 .

[37]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[38]  Alon Itai,et al.  How to Pack Trees , 1999, J. Algorithms.

[39]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[40]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[41]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[42]  Alistair Moffat,et al.  Compressed inverted files with reduced decoding overheads , 1998, SIGIR '98.

[43]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[44]  Alistair Moffat,et al.  Searching large text collections , 2002 .

[45]  Giovanni Manzini,et al.  An experimental study of a compressed index , 2001, Inf. Sci..

[46]  S. Muthukrishnan,et al.  Simple and Practical Sequence Nearest Neighbors with Block Operations , 2002, CPM.

[47]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.

[48]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[49]  V. Ciriani,et al.  Static optimality theorem for external memory string access , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[50]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[51]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[52]  Meng He,et al.  Indexing Compressed Text , 2003 .

[53]  Hans-Werner Mewes,et al.  Genome Analysis: Pattern Search in Biological Macromolecules , 1995, CPM.

[54]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[55]  Roberto Grossi,et al.  Efficient Techniques for Maintaining Multidimensional Keys in Linked Data Structures , 1999, ICALP.

[56]  Juha Kärkkäinen,et al.  One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[57]  Heping Shang Trie Methods for Text and Spatial Data on Secondary Storage , 1994 .

[58]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[59]  Roberto Grossi,et al.  Fast string searching in secondary storage: theoretical developments and experimental results , 1996, SODA '96.

[60]  Shmuel T. Klein,et al.  Detecting Content-Bearing Words by Serial Clustering. , 1995, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[61]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[62]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[63]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[64]  Kotagiri Ramamohanarao,et al.  Guidelines for presentation and comparison of indexing techniques , 1996, SGMD.

[65]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[66]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[67]  Arne Andersson,et al.  Improved Behaviour of Tries by Adaptive Branching , 1993, Inf. Process. Lett..

[68]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[69]  Alejandro A. Schäffer,et al.  Improved dynamic dictionary matching , 1995, SODA '93.

[70]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[71]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[72]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[73]  A. Restivo,et al.  Text Compression Using Antidictionaries , 1999, ICALP.

[74]  Erkki Mäkinen,et al.  A Survey on Binary Tree Codings , 1991, Comput. J..

[75]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[76]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[77]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[78]  David E. Ferguson Bit-Tree: a data structure for fast file processing , 1992, CACM.

[79]  Azer Bestavros,et al.  Temporal Locality in Web Request Streams , 2000 .

[80]  J. Ian Munro,et al.  Membership in Constant Time and Almost-Minimum Space , 1999, SIAM J. Comput..

[81]  Hector Garcia-Molina,et al.  Caching and database scaling in distributed shared-nothing information retrieval systems , 1993, SIGMOD '93.

[82]  Funda Ergün,et al.  Biased dictionaries with fast insert/deletes , 2001, STOC '01.

[83]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[84]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[85]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[86]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.

[87]  N. Ziviani,et al.  Integrating WWW caches and search engines , 1999, Seamless Interconnection for Universal Services. Global Telecommunications Conference. GLOBECOM'99. (Cat. No.99CH37042).

[88]  Gonzalo Navarro,et al.  Regular Expression Searching over Ziv-Lempel Compressed Text , 2001, CPM.

[89]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[90]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[91]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[92]  Fabrizio Luccio,et al.  String Search in Coarse-Grained Parallel Computers , 1999, Algorithmica.

[93]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[94]  W. Bruce Croft,et al.  Supporting Full-Text Information Retrieval with a Persistent Object Store , 1994, EDBT.

[95]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[96]  Alistair Moffat,et al.  Exploiting clustering in inverted file compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[97]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[98]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[99]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[100]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[101]  Erkki Mäkinen,et al.  Tree Compression and Optimization with Applications , 1990, Int. J. Found. Comput. Sci..

[102]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[103]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[104]  S. Muthukrishnan,et al.  Overcoming the memory bottleneck in suffix tree construction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[105]  Brenton Chapin Switching between two on-line list update algorithms for higher compression of Burrows-Wheeler transformed data , 2000, Proceedings DCC 2000. Data Compression Conference.

[106]  Alistair Moffat,et al.  Fast file search using text compression , 1997 .

[107]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[108]  Guido Moerkotte,et al.  Efficient Storage of XML Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[109]  Roberto Grossi,et al.  On sorting strings in external memory (extended abstract) , 1997, STOC '97.

[110]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[111]  Harald Schöning Tamino - A DBMS designed for XML , 2001, ICDE.

[112]  Dongwook Shin,et al.  An effective mechanism for index update in structured documents , 1999, CIKM '99.

[113]  T. H. Merrett,et al.  Trie Methods for Representing Text , 1993, FODO.

[114]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[115]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[116]  Susan Brewer,et al.  Information storage and retrieval , 1959, ACM '59.

[117]  Livio Colussi,et al.  A Time and Space Efficient Data Structure for String Searching on Large Texts , 1996, Inf. Process. Lett..

[118]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[119]  Gerth Stølting Brodal,et al.  Cache Oblivious Distribution Sweeping , 2002, ICALP.

[120]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[121]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[122]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[123]  Denilson Barbosa,et al.  ToX - the Toronto XML Engine , 2001, Workshop on Information Integration on the Web.

[124]  Kurt Mehlhorn,et al.  Algorithm Design and Software Libraries: Recent Developments in the LEDA Project , 1992, IFIP Congress.

[125]  Bernhard Balkenhol,et al.  Modifications of the Burrows and Wheeler data compression algorithm , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[126]  Masami Shishibori,et al.  A Trie Compaction Algorithm for a Large Set of Keys , 1996, IEEE Trans. Knowl. Data Eng..

[127]  Venkatesh Raman,et al.  Succinct representation of balanced parentheses, static trees and planar graphs , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[128]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[129]  T. C. Hu,et al.  Optimal Computer Search Trees and Variable-Length Alphabetical Codes , 1971 .

[130]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[131]  Renzo Sprugnoli On the allocation of binary trees to secondary storage , 1981, BIT Comput. Sci. Sect..

[132]  Reind P. van de Riet,et al.  Two Access Methods Using Compact Binary Trees , 1987, IEEE Transactions on Software Engineering.

[133]  Gerth Stølting Brodal,et al.  Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[134]  Alain Viari,et al.  Flexible Identification of Structural Objects in Nucleic Acid Sequences: Palindromes, Mirror Repeats, Pseudoknots and Triple Helices , 1997, CPM.

[135]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[136]  Ricardo A. Baeza-Yates,et al.  Hierarchies of Indices for Text Searching , 1994, Inf. Syst..

[137]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[138]  Wojciech Plandowski,et al.  Randomized Efficient Algorithms for Compressed Strings: The Finger-Print Approach (Extended Abstract) , 1996, CPM.

[139]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[140]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[141]  Peter M. Fenwick The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements , 1996, Comput. J..

[142]  Ian H. Witten,et al.  Bonsai: A compact representation of trees , 1993, Softw. Pract. Exp..

[143]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[144]  M. Alpers,et al.  Research and Development , 1960, Nature.

[145]  Bernhard Balkenhol,et al.  Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice , 2000, IEEE Trans. Computers.

[146]  Kunihiko Sadakane,et al.  A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[147]  T. Raita,et al.  Markov models for clusters in concordance compression , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[148]  Stephen R. Tate,et al.  Higher compression from the Burrows-Wheeler transform by modified sorting , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[149]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[150]  M. Nelson Data compression with the Burrows-Wheeler Transform , 1996 .

[151]  PughWilliam Skip lists: a probabilistic alternative to balanced trees , 1990 .

[152]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[153]  Divesh Srivastava,et al.  Interaction of query evaluation and buffer management for information retrieval , 1998, SIGMOD '98.

[154]  Tetsuo Shibuya,et al.  Indexing huge genome sequences for solving various problems. , 2001, Genome informatics. International Conference on Genome Informatics.

[155]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[156]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[157]  Ketan Mulmuley,et al.  Computational geometry - an introduction through randomized algorithms , 1993 .

[158]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[159]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[160]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[161]  Kunihiko Sadakane On optimality of variants of the block sorting compression , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[162]  Arne Andersson,et al.  Efficient implementation of suffix trees , 1995, Softw. Pract. Exp..

[163]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[164]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[165]  M. Schindler,et al.  A fast block-sorting algorithm for lossless data compression , 1997, Proceedings DCC '97. Data Compression Conference.

[166]  Evangelos P. Markatos On Caching Search Engine Results , 2000 .

[167]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[168]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .