On Context-Diverse Repeats and Their Incremental Computation

The context in which a substring appears is an important notion to identify --- for example --- its semantic meaning. However, existing classes of repeats fail to take this into account directly. We present here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases maximal and super-maximal repeats. We give sufficient and necessary condition to bound their number linearly in the size of the sequence, and show an optimal algorithm that computes them in linear time --- given a suffix array ---, independent on the size of the alphabet, as well as two other algorithms that are faster in practice. Additionally, we provide an independent and general framework that allows to compute these and other repeats incrementally; extending the application space of repeats in a streaming framework.

[1]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[2]  Enno Ohlebusch,et al.  Computing the Burrows-Wheeler transform of a string and its reverse in parallel , 2014, J. Discrete Algorithms.

[3]  Prosenjit Bose,et al.  Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing , 2009, WADS.

[4]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[5]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[6]  Alberto Apostolico Of Maps Bigger than the Empire (Invited Paper) , 2001, SPIRE.

[7]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[8]  Matthias Gallé The bag-of-repeats representation of documents , 2013, SIGIR.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[11]  Menno van Zaanen ABL: Alignment-Based Learning , 2000, COLING.

[12]  Matthias Gallé,et al.  Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem , 2011 .

[13]  Alberto Apostolico Of maps bigger than the empire , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[14]  Alexander Clark,et al.  Learning deterministic context free grammars: The Omphalos competition , 2006, Machine Learning.

[15]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[16]  Sen Zhang,et al.  Fast and Space Efficient Linear Suffix Array Construction , 2008, Data Compression Conference (dcc 2008).

[17]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[18]  Matthias Gallé,et al.  The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing , 2011, Algorithms.

[19]  Amaury Habrard,et al.  A Polynomial Algorithm for the Inference of Context Free Languages , 2008, ICGI.

[20]  Gonzalo Navarro,et al.  Spaces, Trees, and Colors , 2013, ACM Comput. Surv..

[21]  Niklaus Wirth,et al.  Algorithms and Data Structures , 1989, Lecture Notes in Computer Science.