Full-Text (Substring) Indexes in External Memory

Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the "1000 Genomes Project" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes. Querying strings without well-separated tokens poses a different set of challenges, typically addressed by building full-text indexes, which provide effective structures to index all the substrings of the given strings. Since full-text indexes occupy more space than the raw data, it is often necessary to use disk space for their construction. However, until recently, the construction of full-text indexes in secondary storage was considered impractical due to excessive I/O costs. Despite this, algorithms developed in the last decade demonstrated that efficient external construction of full-text indexes is indeed possible. This book is about large-scale construction and usage of full-text indexes. We focus mainly on suffix trees, and show efficient algorithms that can convert suffix trees to other kinds of full-text indexes and vice versa. There are four parts in this book. They are a mix of string searching theory with the reality of external memory constraints. The first part introduces general concepts of full-text indexes and shows the relationships between them. The second part presents the first series of external-memory construction algorithms that can handle the construction of full-text indexes for moderately large strings in the order of few gigabytes. The third part presents algorithms that scale for very large strings. The final part examines queries that can be facilitated by disk-resident full-text indexes. Table of Contents: Structures for Indexing Substrings / External Construction of Suffix Trees / Scaling Up: When the Input Exceeds the Main Memory / Queries for Disk-based Indexes / Conclusions and Open Problems

[1]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[2]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[3]  Gonzalo Navarro,et al.  Dynamic Fully-Compressed Suffix Trees , 2008, CPM.

[4]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[5]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[6]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.

[7]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[8]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[9]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[10]  Alex Thomo,et al.  Suffix trees for very large genomic sequences , 2009, CIKM.

[11]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[12]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[13]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[14]  Cornelius Bauer The process is as follows , 2005 .

[15]  Gad M. Landau,et al.  On Cartesian Trees and Range Minimum Queries , 2009, ICALP.

[16]  Rene De La Briandais File searching using variable length keys , 1959, IRE-AIEE-ACM Computer Conference.

[17]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[18]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.

[21]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[22]  Konstantin Makarychev,et al.  Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.

[23]  Gad M. Landau,et al.  Introducing efficient parallelism into approximate string matching and a new serial algorithm , 1986, STOC '86.

[24]  Alex Thomo,et al.  Suffix trees for inputs larger than main memory , 2011, Inf. Syst..

[25]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[26]  Gonzalo Navarro,et al.  An(other) Entropy-Bounded Compressed Suffix Tree , 2008, CPM.

[27]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[28]  Jignesh M. Patel,et al.  Practical methods for constructing suffix trees , 2005, The VLDB Journal.

[29]  Marek J. Sergot,et al.  Distributed and Paged Suffix Trees for Large Genetic Databases , 2003, CPM.

[30]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[31]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[32]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[33]  Johannes Fischer,et al.  Space Efficient String Mining under Frequency Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[34]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[35]  Wojciech Szpankowski,et al.  Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[36]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[37]  S. VitterJ.,et al.  Algorithms for parallel memory, I , 1994 .

[38]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[39]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[40]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[41]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.