Compressed Full-Text Indexes for Highly Repetitive Collections

This thesis studies problems related to compressed full-text indexes. A fulltext index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found eciently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data eciently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data. The eciency of data compression is usually measured in terms of entropy.

[1]  J. Ian Munro,et al.  Efficient Suffix Trees on Secondary Storage (extended Abstract) , 1996, SODA.

[2]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[3]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[4]  Enno Ohlebusch,et al.  A Compressed Enhanced Suffix Array Supporting Fast String Matching , 2009, SPIRE.

[5]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[6]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[7]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[8]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[9]  Siu-Ming Yiu,et al.  Indexing Similar DNA Sequences , 2010, AAIM.

[10]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[11]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[12]  S. Srinivasa Rao,et al.  Full-Text Indexes in External Memory , 2002, Algorithms for Memory Hierarchies.

[13]  Sen Zhang,et al.  Linear Time Suffix Array Construction Using D-Critical Substrings , 2009, CPM.

[14]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[15]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[16]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[17]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[18]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[19]  Raffaele Giancarlo,et al.  The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression , 2006, ESA.

[20]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[21]  Simon J. Puglisi,et al.  Space-Time Tradeoffs for Longest-Common-Prefix Array Computation , 2008, ISAAC.

[22]  Wing-Kai Hon,et al.  Compressed data structures: dictionaries and data-aware measures , 2006, Data Compression Conference (DCC'06).

[23]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[24]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[25]  Rodrigo González,et al.  Rank/select on dynamic compressed sequences and applications , 2009, Theor. Comput. Sci..

[26]  Enno Ohlebusch,et al.  Fast and Lightweight LCP-Array Construction Algorithms , 2011, ALENEX.

[27]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[28]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[29]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[30]  Laurent Mouchard,et al.  A four-stage algorithm for updating a Burrows-Wheeler transform , 2009, Theor. Comput. Sci..

[31]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[32]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[33]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[34]  Kaiyong Zhao,et al.  SOAP3: GPU-based compressed indexing and ultra-fast parallel alignment of short reads , 2011 .

[35]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[36]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[37]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[38]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[39]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[40]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[41]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[42]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[43]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[44]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[45]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[46]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[47]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[48]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[49]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[50]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[51]  Laurent Mouchard,et al.  Dynamic extended suffix arrays , 2010, J. Discrete Algorithms.

[52]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[53]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[54]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[55]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[56]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[57]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[58]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2010, LATIN.

[59]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[60]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[61]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes , 2011, SPIRE.

[62]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[63]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[64]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[65]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[66]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, Algorithmica.

[67]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[68]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[69]  Joong Chae Na,et al.  Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space , 2007, Theor. Comput. Sci..

[70]  Roberto Grossi,et al.  Wavelet Trees: From Theory to Practice , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[71]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[72]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[73]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[74]  Paolo Ferragina,et al.  Distribution-Aware Compressed Full-Text Indexes , 2011, ESA.

[75]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[76]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[77]  Kunsoo Park,et al.  Dynamic Compressed Representation of Texts with Rank/Select , 2009, J. Comput. Sci. Eng..

[78]  Veli Mäkinen,et al.  Indexing Finite Language Representation of Population Genotypes , 2010, WABI.

[79]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[80]  Kunsoo Park,et al.  Dynamic rank/select structures with applications to run-length encoded texts , 2009, Theor. Comput. Sci..

[81]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[82]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[83]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[84]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[85]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[86]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[87]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[88]  J. Stoye,et al.  Dynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array Diplomarbeit im Fach , 2007 .