Trends in Su x Sorting: A Survey of Low Memory Algorithms

The suffix array is a sorted array of all the suffixes in a string. This remarkably simple data structure is fundamental for string processing and lies at the heart of efficient algorithms for pattern matching, pattern mining, and data compression. In many applications suffix array construction, or equivalently suffix sorting, is a computational bottleneck and so has been the focus of intense research in the last 20 years. This paper outlines several suffix array construction algorithms that have emerged since the survey due to Puglisi, Smyth and Turpin [ACM Computing Surveys 39, 2007]. These algorithms have tended to strive for small working space (RAM), often at the cost of runtime, and make use of compressed data structures or secondary memory (disk) to achieve this goal. We provide a high-level description of each algorithm, avoiding implementation details as much as possible, and outline directions that could benefit from further research.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[3]  Joong Chae Na Linear-Time Construction of Compressed Suffix Arrays Using o(n log n)-Bit Working Space for Large Alphabets , 2005, CPM.

[4]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[5]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[6]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[7]  Gianni Franceschini,et al.  In-Place Suffix Sorting , 2007, ICALP.

[8]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Gianni Franceschini,et al.  Optimal suffix selection , 2007, STOC '07.

[12]  Meng He,et al.  Indexing Compressed Text , 2003 .

[13]  Roberto Grossi,et al.  Optimal cache-aware suffix selection , 2009, STACS.

[14]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[15]  Jasbir Dhaliwal Faster semi-external suffix sorting , 2014, Inf. Process. Lett..

[16]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[17]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[18]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[19]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[20]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[21]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[22]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[23]  Gianni Franceschini,et al.  Finding the Maximum Suffix with Fewer Comparisons , 2010, CIAC.

[24]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[25]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[26]  Wing-Kai Hon,et al.  Inverted indexes for phrases and strings , 2011, SIGIR.

[27]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[28]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[29]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[30]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[31]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.